Amazon S3 Select and Amazon Athena both allow you to perform SQL style queries against your data in S3. But whats the difference between these two options and when should you use one over another? Find out in this article.
Leveraging S3 to store raw data is a common strategy for many companies. Often though, we’d like to analyze our dataset without having to load it into a traditional database first – doing so wastes both time and money.
S3 Select and Athena are two alternatives that allow you to perform SQL operations on your data in S3 without having to move or transform your database. However both are intended for two very different purposes.
In this article, we’re going to first discuss what S3 Select is, what Athena is, and when to use one over the other.
So let’s get started…
What is S3 Select?
Amazon S3 Select is new feature that allows you to perform simple SQL operations against your raw data stored in S3. One of the requirements is that your data needs to be in a structured format, i.e. JSON, CSV, Parquet. Do note that it will also work on compressed version of files so you don’t need to decompress them before reading.
S3 Select is a completely serverless solution. You don’t provision any servers or databases to make this run.
One of the big limits though is that S3 Select only supports the SELECT clause in SQL. That means no joins, no groupings, and no other sophisticated SQL operations. In your SELECT statement, only the following other clauses are supported:
- FROM
- WHERE
- LIMIT
Also note that nested json fields can also be accessed, so if you have deeply nested objects you should still be good to go.
The other, most important constraint of S3 SELECT is that you can only perform a query on one object at at time. This makes S3 select not a suitable option for many use cases requiring parsing of entire bucket paths or collections of objects.
In terms of pricing, S3 SELECT is priced on a couple different dimensions, also outlined below:
- Number of SELECT requests ($.0004 per 1000 requests)
- Data Scanned ($.0002 per GB)
- Data Returned ($.00007 per GB)
Fore more info on the pricing model check out the link here.
All in all, I would consider S3 select to be an easy way to extract specific portions of your data stored in S3 using SQL without having to retrieve the whole object. This can be integrated at runtime your applications.
What is Amazon Athena?
Amazon Athena is a big data query service that allows you to easy analyze large volumes of data without having to provision servers or databases. Like S3 Select, Athena is also serverless and is based on SQL. But the main distinction between the two is the scale in which Athena lets you perform your queries.
Like we learned with S3 Select, it only supports querying one file at a time. With Amazon Athena, we can perform SQL against any number of objects, or even entire bucket paths. Its not uncommon for folks to perform queries on TERABYTES of data.
Behind the scenes, Athena uses an EMR cluster. When performing a query, Athena submits it as a job, and the Athena service works through the queries on a first come first serve basis. This means that at times, you can expect significant delays in your query’s response time.
Another important note is that queries are asynchronous. In other words, you can submit a query, but you won’t necessarily get an answer right away. When the query is completed, Athena will deliver the result set a S3 bucket of your choice, and if using the console, print out the results.
This means that Athena is not suitable for real time applications, and much better suited for Analytics purposes where latency is likely less of a concern.
Athena also requires you to set up a ‘database’ using AWS Glue. This isn’t a traditional database, but it basically just holds the metadata about your data locations and schema that will be traversed.
So Whats the Difference Between S3 Select and Athena?
S3 Select is a lightweight solution designed to let you use SQL to perform simple SELECT clauses on a maximum of one file. Amazon Athena is an analytics workhorse that allows you to perform SQL on extremely large datasets spanning many files with great performance.
I hope this article clarified the difference between these two options.