In the previous series of posts, I wrote about how to install the complete Hadoop stack on Windows 11 using WSL 2. And now that the new MacBook Pro laptops are available with the brand new M1 Pro and M1 Max SOCs, here’s a guide on how to install the same Hadoop stack on these laptops. Because both M1 Pro and M1 Max use the same architecture, the steps you need to follow to install Hadoop is the same. So it doesn’t matter which MacBook you got, the steps given here should work for you. So, let’s get started.
There are two important dependencies that you’ll need to install to make Hadoop work. These aren’t optional, unless you have them installed already. So make sure you install these dependencies.
As always, you need to install dependencies. To begin with, let’s get JDK 8 installed, because Hadoop is largely dependent on Java. There are two ways of installing JDK on an M1 Mac, using homebrew, or directly from a vendor. We’re going to install the OpenJDK implementation of Azul, which is super easy to install and is also a certified JDK. You can download the JDK from here. As you can see, there are multiple versions available. But let’s stick to 8 for now. Also, make sure you download the ARM 64-bit version of the JDK.
Once you download the installer, installing itself is pretty easy. Just open the installer and follow the steps in the wizard. It shouldn’t take more than a couple of minutes. Once you are done, make sure to export the path to the Java home directory, as this will be used by not just Hadoop, but a lot of other packages as well. For this, get the installation path (which should be very similar to the one given below), and add this to your .zshrc file:
I’m pretty sure you can find the JDK in the exact same path if you installed JDK 8. Anyway, let’s move on to the next dependency.
Enable SSH to localhost
Unlike Linux or Windows, SSH is already installed on Macs. We only need to enable the feature and add our security keys so that we don’t need to provide our passwords every time. First, let’s enable SSH or remote login feature. For this, open up your System Preferences app and find the Sharing menu. From there, on the services list to the left, search for “Remote Login” and enable it. This will be disabled by default. Below is a screenshot for reference.
After this, we have to create a security key for being able to SSH into the localhost. Run the following command to generate a key. Follow the instructions on screen to provide all the required information.
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
Once the key is generated, we have to copy that over to the authorised keys file so that we authorise this key to be used for SSH without password. This is important because Hadoop expects password-less SSH to be available and enabled. So run the following command to copy over the key:
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
In some cases, SSH might not work if the key has too much “public” access. To avoid this, run the following command to restrict permissions to the key:
chmod 0600 ~/.ssh/id_rsa.pub
And that’s it. SSH should be working fine now. To be extra sure, let’s try it out with the following command:
If you don’t get any error for that command, it’s all working as expected. You’re now logged into another session in your terminal using SSH. So let’s logout from there and come back to our previous session. To do this, hit CTRL + d. You should see something similar to the following:
And we now have all dependencies installed and working.
First step to installing Hadoop is to actually download it. As of this writing, the latest version of Hadoop is version 3.3.1, and you can download it from here. You will be downloading a .tar.gz file from there. To decompress it, you can just double click the package and it’ll decompress and create a directory with all the contents. You can move this directory wherever you want to place the Hadoop installation.
Because we’re installing Hadoop on our local machine, we’re going to do a single-node deployment, which is also known as pseudo-distributed mode deployment.
Setting the environment variables
We have to set a bunch of environment variables. The best part is, you have to customize only one variable. The others are just copy-paste. Anyway, following are the variables I’m talking about:
# Hadoop export HADOOP_HOME=/Users/sunny/programs/hadoop-3.3.1/ export HADOOP_INSTALL=$HADOOP_HOME export HADOOP_MAPRED_HOME=$HADOOP_HOME export HADOOP_COMMON_HOME=$HADOOP_HOME export HADOOP_HDFS_HOME=$HADOOP_HOME export YARN_HOME=$HADOOP_HOME export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/nativ"
As you can see, you only have to change the value of the first environment variable, HADOOP_HOME. Set it to reflect the path where you have placed the Hadoop directory. Also, it is a good idea to place these export statements in the .zshrc file so that these variables are exported every time automatically instead of you having to do it. Once you place it in the file, make sure you source it so that it takes effect immediately:
Next, we’ll have to edit a few files to change the config for various Hadoop components. Let’s start that with the file hadoop-env.sh. Run the following command to open the file in the editor:
sudo vim $HADOOP_HOME/etc/hadoop/hadoop-env.sh
Next, find the line that is exporting the
$JAVA_HOME variable and uncomment it. Here, you have to provide the same path that you did when you installed Java earlier. For me, that’s the following:
Next, we have to edit the core-site.xml file. Here we have to provide the temporary directory for Hadoop and also the default name for the Hadoop file system. Open the file in the editor using the following command:
sudo vim $HADOOP_HOME/etc/hadoop/core-site.xml
You’ll find an empty file here with a few comments and an empty configuration block. You can delete everything and replace it with the following:
<configuration> <property> <name>hadoop.tmp.dir</name> <value>/Users/sunny/hdfs/tmp/</value> </property> <property> <name>fs.default.name</name> <value>hdfs://127.0.0.1:9000</value> </property> </configuration>
Make sure you create the temp directory that you configure here. Next, we have to edit the HDFS config file hdfs-site.xml. To do this, open the file in the editor using the following command:
sudo vim $HADOOP_HOME/etc/hadoop/hdfs-site.xml
In this configuration file, we are setting the HDFS data node directory, HDFS name node directory, and the HDFS replication factor. Here again you should get a file with an empty configuration block. Replace that with the following:
<configuration> <property> <name>dfs.data.dir</name> <value>/Users/sunny/hdfs/namenode</value> </property> <property> <name>dfs.data.dir</name> <value>/Users/sunny/hdfs/datanode</value> </property> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration>
And again, make sure you create the data node and name node directories. Next, we have the MapReduce config file. To open this in the editor, run the following command:
sudo vim $HADOOP_HOME/etc/hadoop/mapred-site.xml
You can replace the configuration block with the following:
<configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration>
As you can see, it’s a simple configuration which specifies the MapReduce framework name. And finally, we have the YARN configuration file, yarn-site.xml. Open the file in the editor using:
sudo vim $HADOOP_HOME/etc/hadoop/yarn-site.xml
Add the following configuration to the file:
<configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> <property> <name>yarn.resourcemanager.hostname</name> <value>127.0.0.1</value> </property> <property> <name>yarn.acl.enable</name> <value>0</value> </property> <property> <name>yarn.nodemanager.env-whitelist</name> <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PERPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value> </property> </configuration>
There’s nothing to change in this configuration. And finally, we’re done configuring Hadoop. We can now move on to formatting the name node and starting Hadoop.
Formatting the HDFS name node
It’s important to first format the HDFS name node before starting the Hadoop service the first time. This, obviously, makes sure there’s no junk anywhere in the name node. And once you start using HDFS more frequently, you’ll realize that you’re formatting the name node more often that you thought you would, at least on your development machine. Anyway, to format the name node, use the following command:
hdfs namenode -format
Once you get the shutdown notification for name node, the formatting is complete.
Starting all of Hadoop
Finally, we’re at the best part of this activity, starting and using Hadoop. Now, there are many ways of starting Hadoop depending on what components you actually want to use. For example, you can start only YARN, or HDFS along with it, etc. For this activity, we’ll just start everything. To do this, the Hadoop distribution provides a handy script. And because you have already exported a bunch of environment variables earlier, you don’t even have to search for that script, it’s already in your path. Just run the following command and wait for it to finish:
This will take a few seconds, as the script just waits for the first 10 seconds without doing anything to give you an option to cancel the operation if you started it by mistake. Just hold on and you should see output similar to the following screenshot:
This tells us that all components of Hadoop are up and running. To make sure, if you want to, you run the jps command to get a list of all the processes running. You should see at least the following services:
And that’s it. You’re now running Hadoop on your Windows 11 PC using a Linux distro on WSL 1 or 2. To make sure, you can use the following simple HDFS command:
hdfs dfs -ls /
This command will list all files and directories at the root HDFS directory. If it’s a brand new deployment, you shouldn’t find much there. You’ll get a list similar to the one shown below:
That’s pretty much it. We’re done!
Become a Patron!