AWS Glue is a managed extract, transform, and load (ETL) service that is able to process data stored in S3 or DynamoDB and convert it into different formats or schemas for easier use in other services like Athena.
Why Use AWS Glue?
AWS Glue is intended for people who have too much data to process. Perhaps you have a whole fleet of servers, and each one of them is spitting out log files. You ingest this data into S3 for easy storage, but there’s a lot of it, and it needs to be processed first before being analyzed with Athena. Maybe you’re only interested in a few columns from the data and want to discard the rest.
AWS Glue can handle that; it sits between your S3 data and Athena, and processes data much like how a utility such as sed or awk would on the command line. By setting up a crawler, you can import data stored in S3 into your data catalog, the same catalog used by Athena to run queries. You can then modify this data to remove the unnecessary columns or convert between formats.
AWS Glue can also automatically convert CSV and other delimited formats into the Apache Parquet columnar format, which is highly recommended for anyone working with Athena, as it can cut your costs by an order of magnitude due to it requiring far less data to be processed.
How to Get Started
Head on over to the AWS Glue Console, and select “Get Started.” From the “Crawlers” tab, select “Create Crawler,” and give it a name. Choose “Data Stores” as the import type, and configure it to import data from the S3 bucket where your data is being held.
Next, create a new IAM user for the crawler to operate as. Create it from this dialog, and then select it in the list (you may have to hit the refresh button next to the list).
You can give your crawler a schedule using standard cron syntax, or by selecting one of the predefined options. You can also have it run manually from the console if you’d like.
Choose an output database from your Data Catalog. If you’ve used Athena before, you may have a custom database, but if not, the default one should work fine. The crawler creates a table for itself to store data in.
Converting Data
Once your data is imported into your data catalog database, you can use it in other AWS Glue functions. For example, if you want to process your data, you can create a new job from the “Jobs” tab to handle data conversion.
Give the job a name, and select your IAM role. Select “A Proposed Script Generated By AWS Glue” as the script the job runs, unless you want to manually write one.
From the next tab, select the table that your data was imported into by the crawler. Click next, and then select “Change Schema” as the transform type.
You can choose to create new files, or update the current ones with the new schema instead. If you’re converting to Parquet or other formats, you need to create new files.
From the next page, you can configure where all the magic happens. Each column in the source file is mapped to a column in the output file. You can delete columns and add new ones if you would like. By default, it’s a one-to-one mapping, so if you’re just converting between formats, you can ignore this page.
Next, you’re brought to the script editor, where AWS has preloaded a script that executes the correct transform for you. You can run it manually from this tab in the console, or set it up with a trigger to run on a fixed schedule.
Athena can also be configured to load data from an AWS Glue crawler, rather than from a fixed path in S3. You can also use it to more finely control what data gets imported.