Integrating Hadoop with CephFS

As a precursor to writing a subordinate charm to automate this, I though it would be a good idea to outline the steps to integrating the two here so others in the community can see what I’m doing.

The steps are pretty simple, actually. You can do this with juju deployed hadoop and ceph.

1. Building the cephfs hadoop plugin

From a dev machine, clone the cephfs hadoop plugin:

git clone https://github.com/ceph/cephfs-hadoop.git
cd cephfs-hadoop

Edit the hadoop.version property in pom.xml to the version of hadoop in the node (2.7.3 as of 2/4/2019)

Edit the maven-compiler-plugin version in pom.xml to 3.8.0 (fixes builds on Java versions newer than 8)

Build the thing:

mvn -Dmaven.test.skip=true package

Note: Maven will fail if JAVA_HOME env is not set to your jdk location.

The artifact should be in target/cephfs-hadoop-0.80.6.jar.

2. Adding ceph libraries to hadoop

Upload cephfs-hadoop-0.80.6.jar to /usr/lib/hadoop/lib in the hadoop charm.

From the hadoop node, install libcephfs-java

sudo apt install libcephfs-java

Copy the libcephfs.jar to hadoop/lib

cp /usr/share/java/libcephfs.jar /usr/lib/hadoop/lib

Hadoop won’t always load the libraries, so we need to add them to hadoop’s classpath. Add the following to /etc/hadoop/conf/hadoop-env.sh:

export HADOOP_CLASSPATH="/usr/lib/hadoop/lib/cephfs-hadoop-0.80.6.jar"
export HADOOP_CLASSPATH="/usr/lib/hadoop/lib/libcephfs.jar:$HADOOP_CLASSPATH"

3. Configure hadoop

Remove default FS prop from /etc/hadoop/conf/core-site.xml and add the following properties:

<property>
  <name>fs.ceph.impl</name>
  <value>org.apache.hadoop.fs.ceph.CephFileSystem</value>
</property>

<property>
  <name>ceph.conf.file</name>
  <value>/etc/ceph/ceph.conf</value>
</property>

<property>
  <name>fs.default.name</name>
  <value>ceph:///</value>
</property>

The rest of the information needed to connect to ceph (monitor addresses, fsid, etc) should be in /etc/ceph/ceph.conf.

Example ceph.conf configuration:

[global]
auth cluster required = cephx
auth service required = cephx
auth client required = cephx

mon host = 172.31.103.152:6789 172.31.103.176:6789 172.31.103.199:6789
fsid = 050fc712-25bb-11e9-8024-02b84229d652

log to syslog = false
err to syslog = false
clog to syslog = false
mon cluster log to syslog = false
debug mon = 1/5
debug osd = 1/5

mon pg warn max object skew = -1


public network =
cluster network =
public addr = 172.31.103.92
cluster addr = 172.31.103.199


[mon]
keyring = /var/lib/ceph/mon/$cluster-$id/keyring


[mds]
keyring = /var/lib/ceph/mds/$cluster-$id/keyring

4. Verify the connection

If all is well, the following will list the contents in the root of the ceph file system:

hadoop fs -ls /
3 Likes

hi,I am in trouble about that I have 3 hadoop node with 3 ceph node, but when i start a mapreduce task, I couldn’t controll the map number.

in a word, there is only one hadoop node runing task.

how can i run a mapreduce task on multi-nodes?

ps: when with hdfs, I can change the dfs.block.size and -D mapreduce.job.maps=30 to change map number, but I can not change map number with ceph

pps: I am not a native speaker of English. I may express a problem with my tone, but I am very sincere. I hope to get your answer. Thank you very much.

/usr/local/hadoop-2.7.7/bin/hadoop jar /usr/local/hadoop-2.7.7/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.7.jar teragen -D mapreduce.job.maps=30 -D mapreduce.job.reduces=0 1073741824 /user/input/terasort/100G-input

Thanks for asking the question @nothand. Welcome to the Juju community! I have created a separate discussion to answer it:

p.s. Don’t worry about your English language skills. The majority of the community are not native English speakers. We all have sympathy.