Hadoop Tutorial
Tags: cloud computing, hadoop, java, MapReduce, programming, tutorials
How to set up hadoop environment in a small cluster that contains four machines is introduced in this tutorial. It is not hard to extend this system to a larger cluster just by adding slave nodes.
1. Install needed packages
Sun Java JDK:
Downloaded form: http://java.sun.com/
For example, it is installed into /usr/java/jdk1.6.0_16 which may different depending on jdk‘s version.
Install JDK on ALL nodes of the Hadoop cluster.
Hadoop:
Downloaded from: http://hadoop.apache.org/
Unpack it to somewhere. We store it in /lhome/zma/hadoop in this example
For our convenience that we needn’t to specify the full path when calling hadoop:
$ export PATH=/lhome/zma/hadoop-0.20.2/bin/:$PATH
or
$ ln -s /lhome/zma/hadoop-0.20.2/bin/* /home/zma/bin/
2. Hosts files
Add lines like this to /etc/hosts on ALL nodes:
We use four machines here.
10.0.0.103 vm103 10.0.0.104 vm104 10.0.0.105 vm105 10.0.0.106 vm106
These host’s name will automatically change to the name described in the hosts file after rebooting.
Configure iptables:
We can configure iptables to allow all connections if these nodes are in a secure local area network by this command on all nodes:
# iptables -F # service iptables save
3. Password-less SSH login
More details about password-less ssh login can be found here.
Generate private and public key:
On vm103 the master node:
$ ssh-keygen -t rsa $ chmod 755 .ssh $ cat .ssh/id_rsa.pub >> .ssh/authorized_keys $ chmod 755 .ssh/authorized_keys
Copy .ssh directory to all nodes’ home directories:
$ for ((i=103; i<=106; i++)); do scp -r .ssh vm$i:~/; done;
Try to ssh to all these nodes:
$ for ((i=103; i<=106; i++)); do ssh vm$i ; done;
4. Hadoop Configurations
hadoop/conf/hadoop-env.h
Add or change these lines depending on your configuration:
export JAVA_HOME=/usr/java/jdk1.6.0_16 export HADOOP_HOME=/lhome/zma/hadoop-0.20.2
hadoop/conf/core-site.xml
Add these lines:
Here the namenode is vm103.
<property>
<name>fs.default.name</name>
<value>hdfs://vm103:9000</value>
</property>
hadoop/conf/hdfs-site.xml
Add these lines:
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
hadoop/conf/mapred-site.xml
Add these lines:
Here jobtracker is vm103.
<property>
<name>mapred.job.tracker</name>
<value>vm103:9001</value>
</property>
hadoop/conf/masters
Delete localhost and add:
vm103
hadoop/conf/slaves
Delete localhost and add:
vm104 vm105 vm106
5. Duplicate Hadoop to all nodes
Here we are on vm103:
$ for ((i=104; i<=106; i++)); \ do scp -r /lhome/zma/hadoop vm$i:/lhome/zma/ ; \ done;
6. Start Hadoop
We need to start both the HDFS and MapReduce to start Hadoop.
On vm103 (namenode):
Format a new distributed-filesystem HDFS:
$ hadoop namenode -format
Remember to delete HDFS’s local files on all nodes before reformat it:
$ rm /lhome/zma/hadoop/logs /tmp/hadoop-zma -rf
Start HDFS:
$ start-dfs.sh
Check the HDFS status:
$ hadoop dfsadmin -report
There may be less nodes listed in the report than we actually have. We can try it again.
Start mpared:
On vm103 (jobtracker):
$ start-mapred.sh
Check job status:
$ hadoop job -list
7. Enjoy the fun now
A simple example
Copy the input files into the distributed filesystem:
$ hadoop fs -put hadoop-0.20.2/conf input
Run some of the examples provided:
$ cd hadoop-0.20.2 $ hadoop jar hadoop-0.20.2-examples.jar grep input output 'dfs[a-z.]+'
Examine the output files:
Copy the output files from the distributed filesystem to the local
filesytem and examine them:
$ hadoop dfs -get output output $ cat output/*
or
View the output files on the distributed filesystem:
$ hadoop dfs -cat output/*
8. Shut down Hadoop cluster
$ stop-dfs.sh $ stop-mapred.sh
or simply stop bothe DFS and MapReduce in one command:
$ stop-all.sh
References
[1] http://hadoop.apache.org/common/docs/current/quickstart.html
Updated history
Feb. 23, 2010. Format this post.
May. 1, 2010. Change the part that copy .ssh directory to all machines.
Jun. 30, 2010. Format the post.
Jul. 13, 2010. Revise the article. The jobtracker and namenode can be same master machine.
Jul. 14, 2010. Add iptables configuration commands.
















There is another article on Hadoop present at http://www.thecloudtutorial.com/hadoop-tutorial.html .