Hadoop tutorial for beginners

What will you learn from this Hadoop tutorial for beginners?

This big data hadoop tutorial will cover the pre-installation environment setup to install hadoop on Ubuntu and detail out the steps for hadoop single node setup so that you perform basic data analysis operations on HDFS and Hadoop MapReduce. This hadoop tutorial has been tested with –

Ubuntu Server 12.04.5 LTS (64-bit)
Java Version 1.7.0_101
Hadoop-1.2.1

Attractions of the Hadoop Installation Tutorial

Steps to install the pre-requisites software Java
Configuring the Linux Environment
Hadoop Configuration
Hadoop Single Node Setup-Installing Hadoop in Standalone Mode
Hadoop Single Node Setup-Installing Hadoop in Pseudo Distributed Mode
Common Errors encountered while installing hadoop on Ubuntu and how to troubleshoot them.

Apache Hadoop is supported by all flavors of Linux, thus it is suggested to install Linux OS before setting up the environment for hadoop installation. If you have an OS other than Linux then you can proceed with installing hadoop on Ubuntu through Virtual Machine which has Linux in it. Apache Hadoop can be installed in 3 different modes of execution –

Standalone Mode – Single node hadoop cluster setup
Pseudo Distributed Mode – Single node hadoop cluster setup
Fully Distributed Mode – Multi-node hadoop cluster setup

If you would like more information about Big Data careers, please click the orange “Request Info” button on top of this page.

Do you want the Hadoop Tutorial PDF to be delivered to your inbox? Send us an email at rahul@dezyre.com to get the Hadoop Tutorial PDF delivered to your inbox.

Hadoop Pre-installation Environment Setup

In this hadoop tutorial , we are using Ubuntu Server 12.04.5 LTS (64 bit). You can download it from this link

Hadoop Pre-installation Environment Setup

Check for update or update the source index

Before you begin to install hadoop on Ubuntu , ensure that it is updated with the latest packages from all the repositories and PPA’s. Execute the below command to see if there are any updates available-

$ sudo apt –get update

Check for update or update the source index

Install Java

Java is the main pre-requsiite software to run hadoop. To run hadoop on Ubuntu, you must have Java installed on your machine preferably Java version 1.6+ from Sun/Oracle or OpenJDK must be installed. You can check if Java is already installe don your machine using the below command-java –version. Java^TM 1.6.x or later, preferably from Oracle or Openjdk, must be installed. However, using Java 1.6+ is recommended for this hadoop tutorial. We are using Openjdk-7 to install Java software in this hadoop tutorial-

$ sudo apt-get install openjdk-7-jdk

Install Java_2

You can check the java version installed on your machine by using the command – java -version

Install Java_3

Install SSH

SSH is required to manage the remote machines and your local machines before using hadoop on it. In this hadoop tutorial, we will use openssh server which can be installed as follows –

$ sudo apt-get install openssh-server

Install SSH_2

Adding a dedicated Hadoop User Account

Creating a dedicated hadoop user helps separate HDFS from UNIX file system. We can begin by creating a “Hadoop” group as follows –

The next step is to create a hadoop user named hsuser and add it to the “hadoop” group created in the above step.

$ sudo adduser –ingroup hadoop hduser

On executing the above command, it will prompt you for the password and other details as shown below –

On executing the above command, it will prompt you for the password and other details as shown below

Grant All Permissions to the Hadoop User “hduser created in the above Step

To grant all permissions to the created hadoop user, you must configure the sudoers file located at /etc/sudoers. However, this file cannot be configured directly and should be done using the visudo command as follows-

$ sudo visudo

Just type “o” to insert a line at the end with the command to grant all permissions to the hadoop user “hduser” as follows –

hduser ALL=(ALL) ALL

Grant All Permissions to the Hadoop User “hduser created in the above Step_3

Hit Escape to exit the insert mode of the editor and type “:x” to save the changes and exit from the file.

Configuring SSH Access

SSH Setup is needed to perform various operations on the hadoop cluster so that the master node can login to the slave nodes to start or stop them.SSH must be setup even on the secondary NameNode listed in the master’s file so that it can be started from the Namenode using the command ./start-dfs.sh and the job tracker node with ./start-mapred.sh.

To configure SSH access, you must login as hadoop user hduser –

$ sudo - hduser

The next step is to generate the SSH key for the hadoop user using the following command-

$ ssh-keygen –t rsa –P “”

Configuring SSH Access_3

In the above screenshot, the command hduser@ubuntu:~$ ssh-keygen -t rsa -P “” command will create an empty password RSA key pair. It is not suggested to use an empty password, however, if you do not always want to enter the passphrase whenever hadoop interacts with the nodes then you must give an empty password. This will ensure that hadoop interacts with the nodes without your interaction.

The next step is to enable SSH access with the key generated in the previous step –

$ sudo cat /home/hduser/.ssh/id_rsa.pub >> /home/hduser/.ssh/authorized_keys

Disabling IPv6

As Hadoop does not support IPv6 and is tested to work only IPv4 network, it is suggested to disable IPv6. However, if you are not using IPv6, you can simply skip this step of the hadoop installation process.

To disable IPv6, you need open the file /etc /sysctl.conf using the vi editor as shown below –

$ sudo vi /etc/sysctl.conf

Disabling IPv6_2

Copy the below lines of code to disable IPv6 –

#disable ipv6
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1

Disabling IPv6_4

To ensure that IPv6 has been disabled , run the following command –

$cat /proc/sys/net/ipv6/conf/all/disable_ipv6

Having disabled IPv6, it is suggested that you power off the machine and restart it again using the below command –

$ sudo reboot now

Hurray, you have completed the environment setup to install hadoop. Now, let’s get started with Hadoop installation in standalone mode.

Hadoop Single Node Setup- Standalone Mode

Hadoop on a single node in standalone mode runs as a single java process. This mode of execution is of great help for debugging purpose. This mode of execution helps you run your MapReduce application on small data before you start running it on a hadoop cluster with big data.

Download Hadoop

Hadoop can be downloaded using the “wget” command as shown below –

$ wget https://archive.apache.org/dist/hadoop/core/hadoop-1.2.1/hadoop-1.2.1.tar.gz

Download Hadoop_2

The compressed hadoop file needs to be unzipped as follows –

$ tar –xvzf /home/hduser/hadoop-1.2.1.tar.gz

Verify if Hadoop is installed

To confirm that hadoop has been installed, you can run the following command –

$ ls /home/hduser/

Hadoop Installation Successful

Listing the contents of /home/hduser/ shows that hadoop has been installed. You can now move the contents of the directory to the location of your choice. Let/s move hadoop directory to /usr/local/

$ sudo mv /home/hduser/hadoop-1.2.1 /usr/local/hadoop

Assign Ownership of Hadoop to hduser

Ensure that you change thw ownership of all files to “hduser” and the “hadoop” group.

$ sudo chown –R hduser:hadoop /usr/local/hadoop

Assign Ownership of Hadoop to hduser_2

Hadoop Configuration

Before hadoop is up and running, you need to configure the hadoop environment.

Update .bashrc – We are updating .bashrc with the following Hadoop environment variables:

# Set HADOOP_HOME

export HADOOP_HOME=/home/hduser/hadoop

# Add Hadoop bin and sbin directory to PATH

export PATH=$PATH:$HADOOP_HOME/bin;$HADOOP_HOME/sbin

$ sudo vi /home/hduser/.bashrc

Enter the following lines at the end of the file:

#Hadoop Environment Variables
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin

Hadoop Configuration_3

Enter the following command to update bashrc:

$ exec bash

Hadoop Configuration_5

Java is a pre-requisite for hadoop to run, so you need to inform hadoop where java is installed by setting the variable JAVA_HOME in the hadoop-env.sh file.

$ sudo vi /usr/local/hadoop/conf/hadoop-env.sh

Enter the following lines at the end of the file:

#JAVA HOME variable
Export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-amd64

Note: If you are trying to install hadoop with Ubuntu server or Ubuntu Desktop (32-bit) then your JAVA_HOME would be – JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-i386

Update hadoop-env.sh_3

Confirm the successful installation of Hadoop in standalone mode

$ hadoop jar /usr/local/hadoop/hadoop-examples-1.2.1.jar wordcount /usr/local/hadoop/README.txt /home/hduser/Output

Update hadoop-env.sh_5

Update hadoop-env.sh_6

Hadoop Single Node Setup – Pseudo Distributed Mode

Hadoop is installed on a single machine in this mode of execution also just like standalone mode but in this all the daemons run as separate Java processes i.e. NameNode, DataNode, JobTracker, TaskTracker, Secondary NameNode all run on a single machine.

Create data directory for Hadoop – HDFS

Create a data folder HDFS using mkdir and assign all permissions. A hadoop user will have to read or write to these directories , thus it is necessary to change the permissions of the above directories for the corresponding hadoop user.

$ mkdir /usr/local/hadoop/hdfs

Create data directory for Hadoop - HDFS_2

Hadoop configuration files are present in the HADOOP_HOME/conf dir in this tutorial the path is /usr/local/hadoop/conf/.

Configuring core-site.xml

This XML file contains common properties to HDFS, MapReduce, YARN . Hadoop provides default configuration for these properties in the core-default.xml file.The default properties and their values can be found on the following Github link – https://github.com/facebookarchive/hadoop-20/blob/master/src/core/core-default.xml.

Open the core-site.xml file using the vi editor –

$ sudo vi /usr/local/hadoop/conf/core-site.xml

Enter the following lines between the configuration tab:

fs.default.name
hdfs://192.168.81.139:10001
hadoop.tmp.dir
/usr/local/hadoop/hdfs

Hadoop XML File Configuration

Configuring Hadoop XML files_3

Configuring Hadoop XML files_4

Parameter	Value	Notes
fs.default.name	URI of NameNode.	hdfs://hostname:port
hadoop.tmp.dir	Path to hdfs	used as the base for temporary directories locally

Configuring mapred-site.xml

It contains mapreduce override properties. The default properties and their values here: https://github.com/facebookarchive/hadoop-20/blob/master/src/mapred/mapred-default.xml.

Open the mapred-site.xml using the vi editor –

$ sudo vi /usr/local/hadoop/conf/mapred-site.xml

Enter the below code in the configuration tab-

mapred.job.tracker
hdfs://192.168.81.139:10002

Configuring mapred-site.xml

Configuring mapred-site.xml_3

Configuring mapred-site.xml_4

Parameter	Value	Notes
mapred.job.tracker	Host or IP and port of JobTracker.	hdfs://host:port pair.

Configuring the masters

It contains IP’s or hostname of all secondary NameNode’s or checkpoint servers, one per line. Execute the following command to edit masters:

$ sudo vi /usr/local/hadoop/conf/masters

Enter your checkpoint server or system IP for Pseudo-mode-

192.168.81.139

Configuring the masters_3

Configuring the masters_4

Configuring the Slaves

It contains IP’s or hostname of all datanodes, one per line. Execute the following commands to edit the slaves :

$ sudo vi /usr/local/hadoop/conf/slaves

Enter your datanode IP or system IP for pseudo-mode:

192.168.81.139

Configuring the Slaves_3

Configuring hdfs-site.xml

It contains HDFS override properties. The default properties and their values for hdfs-site. Can be found here: https://github.com/facebookarchive/hadoop-20/blob/master/src/hdfs/hdfs-default.xml.

Execute the following commands to edit the hdfs-site.xml :

$ sudo vi /usr/local/hadoop/conf/hdfs-site.xml

Enter the following lines between tab:

dfs.replication
1

Configuring HDFS Site XML file

Configuring hdfs-site.xml_3

Configuring hdfs-site.xml_4

Parameter	Value	Notes
dfs-replication	Value in positive integer	No. of replicate/duplicate block

Here is the formula to calculate replication:

Replication factor = No. of datanodes or less than the no. of datanode (depends on probability of failure of nodes)

Formatting the NameNode

The foremost step to get hadoop up and running is to format the hadoop distributed file system (HDFS) of your hadoop cluster.NameNode should be formatted when hadoop cluster is setup for the first time. If you format an already running HDFS, you will lose all the data that is present in HDFS for the cluster.

Execute the following command to format the NameNode:

$ hadoop namenode -format

Formatting the NameNode_2

Formatting the NameNode_3

Starting the Hadoop Cluster

Start the Hadoop server by executing the following command:

$ start-all.sh

Starting the Hadoop Cluster_2

To verify if all the services are up and running, execute the below command-

$ jps

Starting the Hadoop Cluster_4

Stop the hadoop servers

Stop the Hadoop server by executing the following command:

$ stop-all.sh

Stop the hadoop servers_2

Troubleshooting

Apache Hadoop uses $HADOOP_HOME/logs directory to maintain all the error logs so whenever you face any issues while isntalling hadoop on ubuntu then look at the log files.

Common Errors Encountered during Hadoop Installation

Error :JAVA_HOME is not set.

Error JAVA_HOME is not set.

Solution to JAVA_HOME is not set error –

Open the hadoop-env.sh file located in HADOOP_HOME/conf/ using vi editor and set the Java home path-

$ sudo vi $HADOOP_HOME/conf/hadoop-env.sh

Solution to JAVA_HOME is not set error_1

Just to ensure that the error is resolved, try starting the hadoop server again using the hadoop command start-all.sh.

Solution to JAVA_HOME is not set error_2

Error: hadoop: command not found

Here’s a snapshot of the error that you might encounter –

Error hadoop command not found_1

To resolve this issue, open the .bachrc fileusing vi editor and define the path for HADOOP_HOME/bin –

$ sudo vi /home/hduser/.bashrc

Error hadoop command not found_2

Execute the below command to update the .bashrc –

Error hadoop command not found_3

Error: ls cannot access .: No such file or directory.

Error ls cannot access No such file or directory_1

When you do not mention the path after the ls command, it takes the default path . /user/. To resolve this issue, create a folder under user/ i.e. in this tutorial it is user/hduser.

$ hadoop fs –mkdir /user/hduse

Error ls cannot access No such file or directory_2

Error : Some index files failed to download or old index file used.

Error Some index files failed to download or old index file used_1

This error can be resolved by removing the old index files by running the command- $ sudo rm –r /var/lib/apt/lists/*

Error Some index files failed to download or old index file used_2

If you encounter any “Incompatible NamespaceID’s” exception then to trouble shoot such error you have to do the following –

Stop all the services
Delete /tmp/hadoop/dfs/data
Start all the services again.

What will you learn from this Hadoop tutorial for beginners?

Attractions of the Hadoop Installation Tutorial

Hadoop Pre-installation Environment Setup

Disabling IPv6

Hadoop Single Node Setup- Standalone Mode

Download Hadoop

Verify if Hadoop is installed

Hadoop Configuration

Hadoop Single Node Setup – Pseudo Distributed Mode

Create data directory for Hadoop – HDFS

Configuring core-site.xml

Configuring mapred-site.xml

Configuring the masters

Configuring the Slaves

Configuring hdfs-site.xml

Formatting the NameNode

Starting the Hadoop Cluster

Troubleshooting

Common Errors Encountered during Hadoop Installation

Leave a Reply Cancel reply