Moving Large Amounts of Data from HDFS (Data Center) to Amazon S3 using S3DistCp

A number of approaches are available for moving large amounts of data from your current storage to Amazon Simple Storage Service (Amazon S3) or from Amazon S3 to Amazon EMR and the Hadoop Distributed File System (HDFS). When doing so, however, it is critical to use the available data bandwidth strategically. With the proper optimizations, uploads of several terabytes a day may be possible. To achieve such high throughput, you can upload data into AWS in parallel from multiple clients, each using multithreading to provide concurrent uploads or employing multipart uploads for further parallelization.

Two Important tools to move data—S3DistCp and DistCp—can help you move data stored on your local (data center) HDFS storage to Amazon S3.

Using S3DistCp
S3DistCp is an extension of DistCp with optimizations to work with AWS, particularly Amazon S3. By adding S3DistCp as a step in a job flow, you can efficiently copy large amounts of data from Amazon S3 into HDFS where subsequent steps in your EMR clusters can process it. You can also use S3DistCp to copy data between Amazon S3 buckets or from HDFS to Amazon S3.
S3DistCp copies data using distributed map–reduce jobs, which is similar to DistCp. S3DistCp runs mappers to compile a list of files to copy to the destination. Once mappers finish compiling a list of files, the reducers perform the actual data copy. The main optimization that S3DistCp provides over DistCp is by having a reducer run multiple HTTP upload threads to upload the files in parallel.

To copy data from your Hadoop cluster to Amazon S3 using S3DistCp
The following is an example of how to run S3DistCp on your own Hadoop installation to copy data from HDFS to Amazon S3.
Tested on Version
Apache Hadoop 1.0.3 distribution and Amazon EMR AMI 2.4.1.

Using S3DistCp
1. Launch a small Amazon EMR cluster (a single node).
elastic-mapreduce –create –alive –instance-count 1 –instance-type m1.small –ami-version 2.4.1

2. Copy the following jars from Amazon EMR’s master node (/home/Hadoop/lib) to your local Hadoop master node under the /lib directory of your Hadoop installation path (For example: /usr/local/hadoop/lib). Depending on your Hadoop installation, you may or may not have these jars. The Apache Hadoop distribution does not contain these jars.
/home/hadoop/lib/emr-s3distcp-1.0.jar
/home/hadoop/lib/aws-java-sdk-1.3.26.jar
/home/hadoop/lib/guava-13.0.1.jar
/home/hadoop/lib/gson-2.1.jar
/home/hadoop/lib/EmrMetrics-1.0.jar
/home/hadoop/lib/protobuf-java-2.4.1.jar
/home/hadoop/lib/httpcore-4.1.jar
/home/hadoop/lib/httpclient-4.1.1.jar

3. Edit the core-site.xml file to insert your AWS credentials. Then copy the core-site.xml config file to all of your Hadoop cluster nodes. After copying the file, it is unnecessary to restart any services or daemons for the change to take effect.fs.s3.awsSecretAccessKey
YOUR_SECRETACCESSKEY

fs.s3.awsAccessKeyId
YOUR_ACCESSKEY fs.s3n.awsSecretAccessKey
YOUR_SECRETACCESSKEY fs.s3n.awsAccessKeyId
YOUR_ACCESSKEY

4. Run s3distcp using the following example (modify HDFS_PATH, YOUR_S3_BUCKET and PATH): hadoop jar /usr/local/hadoop/lib/emr-s3distcp-1.0.jar -libjars /usr/local/hadoop/lib/gson-2.1.jar,/usr/local/hadoop/lib/guava-13.0.1.jar,/usr/local/hadoop/lib/aws-java-sdk-1.3.26.jar,/usr/local/hadoop/lib/emr- s3distcp-1.0.jar,/usr/local/hadoop/lib/EmrMetrics-1.0.jar,/usr/local/hadoop/lib/protobuf-java-2.4.1.jar,/usr/local/hadoop/lib/httpcore-4.1.jar,/usr/local/hadoop/lib/httpclient-4.1.1.jar –src HDFS_PATH –dest s3://YOUR_S3_BUCKET/PATH/ –disableMultipartUpload

Using DistCp
DistCp (distributed copy) is a tool used for large inter- or intra-cluster copying of data. It uses Amazon EMR to effect its distribution, error handling, and recovery, as well as reporting. It expands a list of files and directories into input to map tasks, each of which will copy a partition of the files specified in the source list.

DistCp can copy data from HDFS to Amazon S3 in a distributed manner similar to S3DistCp; however, DistCp is not as fast. DistCp uses the following algorithm to compute the number of mappers required:
min (total_bytes / bytes.per.map, 20 * num_task_trackers)

If you are using DistCp and notice that the number of mappers used to copy your data is less than your cluster’s total mapper capacity, you may want to increase the number of mappers that DistCp uses to copy files by specifying the -m number_of_mappers option.

The following is an example of DistCp command copying /data directory on HDFS to a given Amazon S3 bucket:
hadoop distcp hdfs:///data/ s3n://awsaccesskey:awssecrectkey@somebucket/mydata/

Leave a Reply