28 January 2015

I recently had to analyze a big chunk of data and I decided to use Impala with Amazon Web Services (AWS) Elastic MapReduce (EMR). EMR has a nice web UI to create a cluster that you can afterwards connect to and run your queries on. However, stepping through the UI takes a while and I had to recreate this cluster every day to avoid paying 24/7. Since I had great experience accessing AWS APIs using boto, I wanted to script the cluster setup and my query execution. In addition I wanted to write the results of the queries to CSV files and store them on S3.

As I turned out this was a bit more complicated than I thought. That’s why I will explain the steps in the remainder of this post.

  1. Define your AWS credentials, S3 bucket and key path for your CSV result file.
    Note: This is just an example. You should never store your AWS credentials in a source repository. Instead use environment variables or config files!

  2. Establish a connection to this S3 bucket.

  3. Create a query file according to Impala SQL and upload it to S3.

  4. Create a query runner script that executes the query on EMR and upload it to S3. The query file from the last step is passed as parameter (line 22) and downloaded from S3 to the local machine (line 26). Afterwards the query is executed (line 27) and the result is written to a CSV formatted file (line 28-31). Finally, the result is uploaded to S3 (line 32-34). To avoid an error if the file already exists, the file is deleted before hand (line 32).

  5. Generate an EMR step that executes the query runner script with with the query file as parameter.

  6. Connect to EMR.

  7. Start a new EMR cluster and execute the generated EMR step.

You can find the entire script here.

Suggestion, comments, critique? Please drop me a note. Feedback is always appreciated!

blog comments powered by Disqus