The data analytics benchmark relies on using the Hadoop MapReduce framework to perform machine learning analysis on large-scale datasets. Apache provides an machine learning library, Mahout, that is designed to run with Hadoop and perform large-scale data analytics.
Download the Data Analytics Benchmark.
Prerequisite Software Packages
- Hadoop binaries: we recommended version 0.20.2, which was used and tested in our environment.
- Mahout source files: Mahout is a scalable machine learning and data mining library. Version 0.6 is recommended.
- Maven: a project management tool, version 0.3 or later (required for installing Mahout)
- Java JDK 1.6 or later
The preconfigured prerequisite software packages are already included in our distribution, but they can be also downloaded separately from the Apache website.
Note: some of the configuration steps below might not be necessary if you use the preconfigured packages from our archive.
Setting up Hadoop
Hadoop contains two main components: the HDFS file system and the MapReduce infrastructure. HDFS is the distributed file system that stores the data on several nodes of the cluster called the data nodes. MapReduce is the component in charge of executing computational tasks on top of the data stored in the distributed file system.
Installing Hadoop on a single node:
- It is recommended to create a Hadoop user (Note: This requires
root privileges and the commands can be different in various Linux
distributions such as useradd vs adduser):
- sudo groupadd hadoop
- sudo useradd -g hadoop hadoop
- sudo passwd hadoop (to setup the password)
- Preparing SSH:
- su - hadoop
- ssh-keygen -t rsa -P "" (press enter for any prompts)
- cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
- ssh localhost (answer yes to the prompt)
- Unpack the downloaded Hadoop. Then, chown -R hadoop:hadoop hadoop-0.20.2
- Update the ~/.bashrc file (or an equivalent configuration file if you use other shells) with the following:
- export HADOOP_HOME=/path/to/hadoop (e.g., /home/username/hadoop-0.20.0/)
- export JAVA_HOME=/pth/to/jdk (e.g., /usr/lib/jvm/java-6-sun)
- Edit the conf/hadoop-env.sh configuration file and specify the Java JDK location (JAVA_HOME)
- In the HADOOP_HOME folder (the main Hadoop folder) make the following modifications to the configuration files:
- In conf/core-site.xml add:
<description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.</description>
- In conf/mapred-site.xml> add:
<description>The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task. </description>
Note that there are other parameters that should be tuned to fit the needs of your environment, such as mapred.tasktracker.map.tasks.maximum, mapred.tasktracker.reduce.tasks.maximum, mapred.map.tasks, mapred.reduce.tasks. For now we will rely on the default values.
- In conf/hdfs-site.xml add
<description>Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. </description>
<value>/path/to/tmpdir (a folder of your choice with read/write permisions) </value>
<value>/path/to/namespace/dir (a folder of your choice with read/write permisions) </value>
<value>/path/to/file system/dir (a folder of your choice read/write permisions) </value>
- In conf/core-site.xml add:
- Format the HDFS filesystem:
$HADOOP_HOME/bin/hadoop namenode -format
- To start Hadoop:
$HADOOP_HOME/bin/start-all.sh (alternatively, you can start hdfs then mapreduce by start-dfs.sh and start-mapred.sh respectively)
- To stop Hadoop:
$HADOOP_HOME/bin/stop-all.sh (or stop-dfs.sh and stop-mapred.sh)
For more information about Hadoop installation, this link provides a step-by-step guide to install Hadoop on a single node.
To install Hadoop on multiple nodes, you need to define the slave nodes in conf/slaves and maintain the node that you already installed Hadoop on as the master. Then, start all the nodes with start-all.sh from the master node. This link provides more details on installing Hadoop on multiple nodes.
- Download Mahout (already included in the analytics benchmark package)
- For installing Mahout:
mvn install -DskipTests (from the $MAHOUT_HOME folder)
This will build the core package and the Mahout examples. You need to have Maven installed to be able to perform this step.
- For further information regarding Mahout installation please refer to Mahout website.
Running the data analytics benchmark
This benchmark runs a modified Wikipedia Bayes classification example provided by Mahout package:
- This benchmark uses a Wikipedia data set of ~30GB. You can download it from the following address.
- Unzip the bz2 file to get enwiki-latest-pages-articles.xml.
- Create the folder $MAHOUT_HOME/examples/temp and copy enwiki-latest-pages-articles.xml into this folder.
- To train the classifier, the benchmark uses another, smaller data set. You can download it from the following address.
- Unzip the bz2 file to get enwiki-20100904-pages-articles1.xml and copy the file into $MAHOUT_HOME/examples/temp.
- The next step requires to chunk both the input and training data into 64MB pieces. For this step run the following:
$MAHOUT_HOME/bin/mahout wikipediaXMLSplitter -d $MAHOUT_HOME/examples/temp/enwiki-latest-pages-articles.xml -o wikipedia/chunks -c 64
$MAHOUT_HOME/bin/mahout wikipediaXMLSplitter -d $MAHOUT_HOME/examples/temp/enwiki-20100904-pages-articles1.xml -o wikipedia-training/chunks -c 64
- After creating the chunks, verify that the chunks were written successfully into HDFS (look for chunk-*.xml files):
hadoop dfs -ls wikipedia/chunks
hadoop dfs -ls wikipedia-training/chunks
- Before running the benchmark, you need to create the category-based splits of the Wikipedia dataset and the training dataset:
- Locate the file named categories.txt in the benchmark archive and copy the file into $MAHOUT_HOME/examples/temp.
- $MAHOUT_HOME/bin/mahout wikipediaDataSetCreator -i wikipedia/chunks -o wikipediainput -c $MAHOUT_HOME/examples/temp/categories.txt
- $MAHOUT_HOME/bin/mahout wikipediaDataSetCreator -i wikipedia-training/chunks -o traininginput -c $MAHOUT_HOME/examples/temp/categories.txt
- Check if the wikipediainput and traininginput folders were successfully created. You should see part-r-00000 file inside these folders
hadoop fs -ls wikipediainput
hadoop fs -ls traininginput
- Build the classifier model. After completion, the model will be available in HDFS, under the wikipediamodel folder:
$MAHOUT_HOME/bin/mahout trainclassifier -i traininginput -o wikipediamodel -mf 4 -ms 4
- Test the classifier model:
$MAHOUT_HOME/bin/mahout testclassifier -m wikipediamodel -d wikipediainput --method mapreduce
By default the testing phase is running as a standalone application, the --method mapreduce option is required to run the benchmark in mapreduce framework. To list all the options available for running the benchmark, please type:
$MAHOUT_HOME/bin/mahout testclassifier --help
- The following parameters in $HADOOP_HOME/conf/mapred-site.xml are
necessary in the tuning process (you should add them and modify
according to your platform):
- mapred.map.tasks: specifies the number of map tasks per job
- mapred.reduce.tasks: the number of reduce tasks per job
- mapred.child.java.opts: specify options such as the heap size taken by a java process, e.g. -Xmx1536M
- The metric is specified as the number of pages classified per unit of time. The result of the benchmark run will specify the number of pages classified and the time need to finish, and this can be used to derive the metric directly.