Moving Large Amounts of Data from Local Disk (non-HDFS) to Amazon S3

Fortunately, We have several tools at our disposal to move data from local disks to Amazon S3

1. Using the Jets3t Java Library
JetS3t is an open-source Java toolkit for developers to create powerful yet simple applications to interact with Amazon S3 or Amazon CloudFront. JetS3t provides low-level APIs but also comes with tools that let you work with Amazon S3.

One of the tools provided in the JetS3t toolkit is an application called Synchronize. Synchronize is a command-line application for synchronizing directories on your computer with an Amazon S3 bucket. It is ideal for performing backups or synchronizing files between different computers.
One of the benefits of Synchronize is configuration flexibility. Synchronize can be configured to open as many upload threads as possible.

To set up Synchronize
1. Download JetS3Tt from the following URL: http://jets3t.s3.amazonaws.com/downloads.html.
2. Unzip jets3t.
3. Create a synchronize.properties file and add the following parameters, replacing the values for accesskey and secretkey with your AWS access key identifiers:
accesskey=xxx
secretkey=yyy
upload.transformed-files-batch-size=100
httpclient.max-connections=100
storage-service.admin-max-thread-count=100
storage-service.max-thread-count=10
threaded-service.max-thread-count=15
4. Run Synchronize using the following command line example:
bin/synchronize.sh -k UP somes3bucket/data /data/ –properties synchronize.properties

2. GNU Parallel
GNU parallel is a shell tool that lets you use one or more computers to execute jobs in parallel. GNU parallel runs jobs, which can be a single command or a small script to run for each of the lines in the input. Using GNU parallel, you can parallelize the process of uploading multiple files by opening multiple threads simultaneously. In general, you should open as many parallel upload threads as possible to use most of the available bandwidth. The following is an example of how you can use GNU parallel: 1. Create a list of files that you need to upload to Amazon S3 with their current full path 2. Run GNU parallel with any Amazon S3 upload/download tool and with as many thread as possible using the following command line example: ls | parallel -j0 -N2 s3cmd put {1} s3://somes3bucket/dir1/ The previous example copies the content of the current directly (ls) and runs GNU parallel with two parallel threads (-N2) to Amazon S3 by running the s3cmd command.

3. Direct-to-S3
Aspera Direct-to-S3 offers UDP-based file transfer protocol that would transfer large amount of data with fast speed directly to Amazon S3. If you have a large amount of data stored in your local data center and would like to move your data to Amazon S3 for later processing on AWS (Amazon EMR for example), Aspera Direct-To-S3 can help move your data to Amazon S3 faster compared to other protocols such as HTTP, FTP, SSH, or any TCP-based protocol. http://cloud.asperasoft.com/big-data-cloud/.

4. Using AWS Import/Export
AWS Import/Export accelerates moving large amounts of data into and out of AWS using portable storage devices for transport.
To use AWS Import/Export
1. Prepare a portable storage device from the list of supported devices. For more information, see Selecting Your Storage Device, http://aws.amazon.com/importexport/#supported_devices.
2. Submit a Create Job request to AWS that includes your Amazon S3 bucket, Amazon Elastic Block Store (EBS), or Amazon Glacier region, AWS access key ID, and return shipping address. You will receive back a unique identifier for the job, a digital signature for authenticating your device, and an AWS address to which to ship your storage device.
3. Securely identify and authenticate your device. For Amazon S3, place the signature file on the root directory of your device. For Amazon EBS or Amazon Glacier, tape the signature barcode to the exterior of the device.
4. Ship your device along with its interface connectors, and power supply to AWS.

Leave a Reply