Home » Cloud Computing

Hadoop tutorial

How to set up hadoop environment in a small cluster that contains four machines is introduced in this tutorial. It is not hard to extend this system to a larger cluster just by adding slave nodes.

1. Install needed packages

Sun Java JDK:

Downloaded form: http://java.sun.com/
For example, it is installed into /usr/java/jdk1.6.0_16 which may different depending on jdk‘s version.

Install JDK on ALL nodes of the Hadoop cluster.

Hadoop:

Downloaded from: http://hadoop.apache.org/

Unpack it to somewhere. We store it in /lhome/zma/hadoop in this example

For our convenience that we needn’t to specify the full path when calling hadoop:

$ export PATH=/lhome/zma/hadoop-0.20.2/bin/:$PATH

or

$ ln -s /lhome/zma/hadoop-0.20.2/bin/* /home/zma/bin/

2. Hosts files

Add lines like this to /etc/hosts on ALL nodes:

We use four machines here.

10.0.0.103  vm103
10.0.0.104  vm104
10.0.0.105  vm105
10.0.0.106  vm106

These host’s name will automatically change to the name described in the hosts file after rebooting.

Configure iptables:

We can configure iptables to allow all connections if these nodes are in a secure local area network by this command on all nodes:

# iptables -F
# service iptables save

3. Password-less SSH login

More details about password-less ssh login can be found here.

Generate private and public key:

On vm103 the master node:

$ ssh-keygen -t rsa
$ chmod 755 .ssh
$ cat .ssh/id_rsa.pub >> .ssh/authorized_keys
$ chmod 755 .ssh/authorized_keys

Copy .ssh directory to all nodes’ home directories:

$ for ((i=103; i<=106; i++)); do scp -r .ssh vm$i:~/; done;

Try to ssh to all these nodes:

$ for ((i=103; i<=106; i++)); do ssh vm$i ; done;

4. Hadoop Configurations

hadoop/conf/hadoop-env.h

Add or change these lines depending on your configuration:

export JAVA_HOME=/usr/java/jdk1.6.0_16
export HADOOP_HOME=/lhome/zma/hadoop-0.20.2

hadoop/conf/core-site.xml

Add these lines:
Here the namenode is vm103.

  <property>
    <name>fs.default.name</name>
    <value>hdfs://vm103:9000</value>
  </property>

hadoop/conf/hdfs-site.xml

Add these lines:

  <property>
    <name>dfs.replication</name>
    <value>3</value>
  </property>

hadoop/conf/mapred-site.xml

Add these lines:
Here jobtracker is vm103.

<property>
    <name>mapred.job.tracker</name>
    <value>vm103:9001</value>
</property>

hadoop/conf/masters

Delete localhost and add:

vm103

hadoop/conf/slaves

Delete localhost and add:

vm104
vm105
vm106

5. Duplicate Hadoop to all nodes

Here we are on vm103:

$ for ((i=104; i<=106; i++)); \
do scp -r /lhome/zma/hadoop vm$i:/lhome/zma/ ; \
done;

6. Start Hadoop

We need to start both the HDFS and MapReduce to start Hadoop.

On vm103 (namenode):

Format a new distributed-filesystem HDFS:

$ hadoop namenode -format

Remember to delete HDFS’s local files on all nodes before reformat it:

$ rm /lhome/zma/hadoop/logs /tmp/hadoop-zma -rf

Start HDFS:

$ start-dfs.sh

Check the HDFS status:

$ hadoop dfsadmin -report

There may be less nodes listed in the report than we actually have. We can try it again.

Start mpared:

On vm103 (jobtracker):

$ start-mapred.sh

Check job status:

$ hadoop job -list

7. Enjoy the fun now

A simple example

Copy the input files into the distributed filesystem:

$ hadoop fs -put hadoop-0.20.2/conf input

Run some of the examples provided:

$ cd hadoop-0.20.2
$ hadoop jar hadoop-0.20.2-examples.jar grep input output 'dfs[a-z.]+'

Examine the output files:

Copy the output files from the distributed filesystem to the local
filesytem and examine them:

$ hadoop dfs -get output output
$ cat output/*

or

View the output files on the distributed filesystem:

$ hadoop dfs -cat output/*

8. Shut down Hadoop cluster

$ stop-dfs.sh
$ stop-mapred.sh

or simply stop bothe DFS and MapReduce in one command:

$ stop-all.sh

References

[1] http://hadoop.apache.org/common/docs/current/quickstart.html

Updated history
Feb. 23, 2010. Format this post.
May. 1, 2010. Change the part that copy .ssh directory to all machines.
Jun. 30, 2010. Format the post.
Jul. 13, 2010. Revise the article. The jobtracker and namenode can be same master machine.
Jul. 14, 2010. Add iptables configuration commands.

Read more:

Digg del.icio.us Stumble Techorati Facebook Newsvine Reddit Twitter
Mixx LinkedIn Google Bookmark Yahoo Bookmark MySpace LiveJournal Blogger RSS feed
1 Star2 Stars3 Stars4 Stars5 Stars (No Ratings Yet)
Loading ... Loading ...

One Comment »

Leave your response!

Add your comment below, or trackback from your own site. You can also subscribe to these comments via RSS.

Be nice. Keep it clean. Stay on topic. No spam.

You can use these tags:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

This is a Gravatar-enabled weblog. To get your own globally-recognized-avatar, please register at Gravatar.