Search This Blog

Tuesday, July 17, 2012

Run Sqoop on Amazon Elastic MapReduce (EMR) with Amazon RDS

Amazon EMR doesn't have Sqoop installed.  It is possible to run Sqoop with Amazon EMR.  The following blog shows how to install and run Sqoop:

http://blog.kylemulka.com/2012/04/how-to-install-sqoop-on-amazon-elastic-map-reduce-emr/

However, the solution isn't perfect since the input files are usually in S3 and Sqoop doesn't support S3 directly.  Here is my script to install sqoop and export data from S3 to Amazon RDS (mysql):

#!/bin/bash
BUCKET_NAME=zli-emr-test
SQOOP_FOLDER=sqoop-1.4.1-incubating__hadoop-0.20
SQOOP_TAR=$SQOOP_FOLDER.tar.gz

##change to home directory
cd ~

##Install sqoop on emr
hadoop fs -copyToLocal s3n://$BUCKET_NAME/$SQOOP_TAR $SQOOP_TAR
tar -xzf $SQOOP_TAR

##Install jdbc driver (ex mysql-connection-java.jar) to sqoop lib folder
hadoop fs -copyToLocal s3n://$BUCKET_NAME/mysql-connector-java-5.1.19.jar ~/$SQOOP_FOLDER/lib/
##Copy input file from S3 to hdf
HADOOP_INPUT=hdfs:///user/hadoop/myinput
hadoop distcp s3://$BUCKET_NAME/myinput $HADOOP_INPUT
~/$SQOOP_FOLDER/bin/sqoop export --connect jdbc:mysql://RDS-Host-name:3306/DB_NAME --username USERNAME --password PASSWORD --table TABLE_NAME --export-dir $HADOOP_INPUT --input-fields-terminated-by='\t'
The script assumes that sqoop tar ball and mysql-connector-java.jar are in S3 bucket, as well as the input file are in S3 too.

Note that, RDS needs to be configured to allow access to the database with the following 2 EC2 security groups:
ElasticMapReduce-master
ElasticMapReduce-slave

2 comments:

  1. Thank you for your article
    Should one define HADOOP_INPUT or is it expected to be defined when the script runs? What should be the value?
    Thank you

    ReplyDelete
    Replies
    1. HADOOP_INPUT should be defined in the script. The format was messed up when I did copy and paste. I just fixed the format.

      Delete