Installing Hadoop on OS X El Capitan (And probably Sierra)

This information is primarily sourced (read heavily copied and modified) from: getblueshift.com and references stackoverflow.com.  This also assumes you have homebrew and Java 1.7+ installed.

When I did this it installed Hadoop 2.7.3

Steps:
1.  Set JAVA_HOME in your bash profile.

  $ export JAVA_HOME=$(/usr/libexec/java_home)

2.  Install hadoop with brew, as of this writing it will download and install 2.7.3

  $ brew install hadoop

3.  To make hadoop work on a single node cluster you have to go through several steps outlined here, here are the steps in brief

4.  Setup ssh to connect to localhost without login

  $ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
  $ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

5. Test being able to login, if you are not able to you have to turn on Remote Login in System Preferences -> Sharing

  $ ssh localhost

6.  Brew installs Hadoop usually in /usr/local/Cellar/hadoop/

  $ cd /usr/local/Cellar/hadoop/2.7.3

Note that the version number may be different for your install

7.  Edit following config files in directory /usr/local/Cellar/hadoop/2.7.3/libexec/etc/hadoop

  $ vi hdfs-site.xml

    <configuration>
      <property>
        <name>dfs.replication</name>
        <value>1</value>
      </property>
    </configuration>

  $ vi core-site.xml

    <configuration>
      <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
      </property>
    </configuration>

  $ vi mapred-site.xml

    <configuration>
      <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
      </property>
    </configuration>

  $ vi yarn-site.xml

    <configuration>
      <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
      </property>
      <property>
        <name>yarn.resourcemanager.address</name>
        <value>127.0.0.1:8032</value>
      </property>
      <property>
        <name>yarn.resourcemanager.scheduler.address</name>
        <value>127.0.0.1:8030</value>
      </property>
      <property>
        <name>yarn.resourcemanager.resource-tracker.address</name>
        <value>127.0.0.1:8031</value>
      </property>
    </configuration>

8.  Format and start HDFS and Yarn.  See Troubleshooting Note at the bottom

  $ cd /usr/local/Cellar/hadoop/2.7.3
$ ./bin/hdfs namenode -format
$ ./sbin/start-dfs.sh
$ ./bin/hdfs dfs -mkdir /user
$ ./bin/hdfs dfs -mkdir /user/<username>
$ ./sbin/start-yarn.sh

9.  Hadoop talks to itself using two addresses, localhost and <machine-name>.  Out of the box I had communication issues.  The three address properties we added to yarn-site.xml fix one of the communication issues.  The other will occur when MapReduce starts to run.  It will try to connect to <machine-name> (Serenity in my case) but won’t be able to find it.  To fix this issue we need to modify /etc/hosts

  $ sudo vi \etc\hosts

The very first line will be:
    127.0.0.1       localhost
And we need to change it to:
    127.0.0.1       localhost Serentiy
This defines both “localhost” and “Serentiy” as aliases for 127.0.0.1

10.  Test examples code that came with the hadoop version

  $ ./bin/hdfs dfs -put libexec/etc/hadoop input
$ ./bin/hadoop jar libexec/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.4.1.jar grep input output 'dfs[a-z.]+'
$ ./bin/hdfs dfs -get output output
$ cat output/*

11.  Remove tmp files

  $ ./bin/hdfs dfs -rmr /users//input
$ ./bin/hdfs dfs -rmr /users//ouput
$ rm -rf output/

12.  Stop HDFS and Yarn after you are done

  $ ./sbin/stop-yarn.sh
$ ./sbin/stop-dfs.sh

13.  Add HADOOP_HOME and CONFIG to bashrc for future use

  $ export HADOOP_HOME=/usr/local/Cellar/hadoop/2.7.3
$ export HADOOP_CONF_DIR=$HADOOP_HOME/libexec/etc/hadoop

14.  Complete!  See the original article for a note about installing Pig

Troubleshooting Note

Note: HDFS attempts to put the HDFS folder in/home.  On one of my machines this failed due to permission issues and I had to move the HDFS to a new location.  I chose /usr/local/share/hduser/.  If you need to move the folder location, you will need to create the directories and the add two more properties to HDFS

  $ mkdir -p /usr/local/share/hduser/mydata/hdfs/namenode
  $ mkdir -p /usr/local/share/hduser/mydata/hdfs/datanode
  $ mkdir -p /usr/local/share/hduser/tmp

  $ vi hdfs-site.xml

    <configuration>
      <property>
        <name>dfs.replication</name>
        <value>1</value>
      </property><property>
        <name>dfs.namenode.name.dir</name>
        <value>file:/usr/local/share/hduser/mydata/hdfs/namenode</value>
      </property>
      <property>
        <name>dfs.datanode.name.dir</name> 
        <value>file:/usr/local/share/hduser/mydata/hdfs/datanode</value>
      </property>
    </configuration>

  $ vi core-site.xml
    <configuration>
      <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
      </property>
      <property>
        <name>hadoop.tmp.dir</name>
        <value>file:/usr/local/share/hduser/tmp</value>
      </property>
    </configuration>

If you change the folder, you will need to rerun “hdfs namenode -format” to format the new location.

One Comment

  1. Posted May 18, 2018 at 12:16 am | Permalink

    Thanks a lot very much for the high quality and results-oriented help. I won’t think twice to endorse your blog post to anybody who wants and needs support about this area.