According to apache Hadoop site, Hadoop is a framework that allows for the distributed processing of very large data sets across clusters of computers using simple programming models. In real world we do encounter scenarios where we need to handle large information often stored in small data sets. Consider a scenario where we need to process a million files each of size 1 MB .
To handle this, hadoop provides two solutions, HAR files (Hadoop Archives) and Sequence Files, each has its own trade offs.
HAR files are created by issuing a hadoop archive command, which runs a MapReduce job to create the files being archived into small number of HDFS files. HAR files are less effective in improving the performance of processing the small files in HDFS and are more useful for archival porposes.
The other option hadoop has provided to us is Sequence Files, in which we use file name as the key and its contents as the value. In here, we write a program to write all our small files into one sequence file and let us our mapreduce to process this file.