Handling many small files problem in Hadoop

According to apache Hadoop site, Hadoop is a framework that allows for the distributed processing of very large data sets across clusters of computers using simple programming models. In real world we do encounter scenarios where we need to handle large information often stored in small data sets. Consider a scenario where we need to process a million files each of size 1 MB .

To handle this, hadoop provides two solutions, HAR files (Hadoop Archives) and Sequence Files, each has its own trade offs.

HAR files are created by issuing a hadoop archive command, which runs a MapReduce job to create the files being archived into small number of HDFS files. HAR files are less effective in improving the performance of processing the small files in HDFS and are more useful for archival porposes.

The other option hadoop has provided to us is Sequence Files, in which we use file name as the key and its contents as the value. In here, we write a program to write all our small files into one sequence file and let us our mapreduce to process this file.

Leave a comment

Filed under Uncategorized

Configuring Rack Awareness in Hadoop

We are aware of the fact that hadoop divides the data into multiple file blocks and stores them on different machines. If Rack Awareness is not configured, there may be a possibility that hadoop will place all the copies of the block in same rack which results in loss of data when that rack fails.

Although rare, as rack failure is not as frequent as node failure, this can be avoided by explicitly configuring the Rack Awareness in conf-site.xml.

Rack awareness is configured using the property “topology.script.file.name” in the core-site.xml.

If “topology.script.file.name” is not configured, /default-rack is passed for any ip address i.e., all nodes are placed on same rack.

Configuring Rack awareness in hadoop involves two steps,

  1. configure the “topology.script.file.name” in core-site.xml ,

    <property>

    <name>topology.script.file.name</name>

    <value>core/rack-awareness.sh</value>

    </property>

  2. Implement the rack-awareness.sh scripts as desired, Sample rack-awareness scripts can be found here,

Sample 1: Script with datafile

Topology Script

A sample Bash shell script:

HADOOP_CONF=/etc/hadoop/conf

while [ $# -gt 0 ] ; do
  nodeArg=$1
  exec< ${HADOOP_CONF}/topology.data
  result=””
  while read line ; do
    ar=( $line )
    if [ “${ar[0]}” = “$nodeArg” ] ; then
      result=”${ar[1]}”
    fi
  done
  shift
  if [ -z “$result” ] ; then
    echo -n “/default/rack “
  else
    echo -n “$result “
  fi
done

Topology data

hadoopdata1.ec.com     /dc1/rack1
hadoopdata1            /dc1/rack1
10.1.1.1               /dc1/rack2

References,

http://wiki.apache.org/hadoop/topology_rack_awareness_scripts

https://issues.apache.org/jira/secure/attachment/12345251/Rack_aware_HDFS_proposal.pdf

Leave a comment

Filed under Hadoop, Hadoop Administration, Uncategorized