Tag Archives: hadoop network topology

Configuring Rack Awareness in Hadoop

We are aware of the fact that hadoop divides the data into multiple file blocks and stores them on different machines. If Rack Awareness is not configured, there may be a possibility that hadoop will place all the copies of the block in same rack which results in loss of data when that rack fails.

Although rare, as rack failure is not as frequent as node failure, this can be avoided by explicitly configuring the Rack Awareness in conf-site.xml.

Rack awareness is configured using the property “topology.script.file.name” in the core-site.xml.

If “topology.script.file.name” is not configured, /default-rack is passed for any ip address i.e., all nodes are placed on same rack.

Configuring Rack awareness in hadoop involves two steps,

  1. configure the “topology.script.file.name” in core-site.xml ,

    <property>

    <name>topology.script.file.name</name>

    <value>core/rack-awareness.sh</value>

    </property>

  2. Implement the rack-awareness.sh scripts as desired, Sample rack-awareness scripts can be found here,

Sample 1: Script with datafile

Topology Script

A sample Bash shell script:

HADOOP_CONF=/etc/hadoop/conf

while [ $# -gt 0 ] ; do
  nodeArg=$1
  exec< ${HADOOP_CONF}/topology.data
  result=””
  while read line ; do
    ar=( $line )
    if [ “${ar[0]}” = “$nodeArg” ] ; then
      result=”${ar[1]}”
    fi
  done
  shift
  if [ -z “$result” ] ; then
    echo -n “/default/rack “
  else
    echo -n “$result “
  fi
done

Topology data

hadoopdata1.ec.com     /dc1/rack1
hadoopdata1            /dc1/rack1
10.1.1.1               /dc1/rack2

References,

http://wiki.apache.org/hadoop/topology_rack_awareness_scripts

https://issues.apache.org/jira/secure/attachment/12345251/Rack_aware_HDFS_proposal.pdf

Advertisements

Leave a comment

Filed under Hadoop, Hadoop Administration, Uncategorized