Handling many small files problem in Hadoop

According to apache Hadoop site, Hadoop is a framework that allows for the distributed processing of very large data sets across clusters of computers using simple programming models. In real world we do encounter scenarios where we need to handle large information often stored in small data sets. Consider a scenario where we need to process a million files each of size 1 MB .

To handle this, hadoop provides two solutions, HAR files (Hadoop Archives) and Sequence Files, each has its own trade offs.

HAR files are created by issuing a hadoop archive command, which runs a MapReduce job to create the files being archived into small number of HDFS files. HAR files are less effective in improving the performance of processing the small files in HDFS and are more useful for archival porposes.

The other option hadoop has provided to us is Sequence Files, in which we use file name as the key and its contents as the value. In here, we write a program to write all our small files into one sequence file and let us our mapreduce to process this file.


Leave a comment

Filed under Uncategorized

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s