Integrating Hadoop with CephFS


#1

As a precursor to writing a subordinate charm to automate this, I though it would be a good idea to outline the steps to integrating the two here so others in the community can see what I’m doing.

The steps are pretty simple, actually. You can do this with juju deployed hadoop and ceph.

1. Building the cephfs hadoop plugin

From a dev machine, clone the cephfs hadoop plugin:

git clone https://github.com/ceph/cephfs-hadoop.git
cd cephfs-hadoop

Edit the hadoop.version property in pom.xml to the version of hadoop in the node (2.7.3 as of 2/4/2019)

Edit the maven-compiler-plugin version in pom.xml to 3.8.0 (fixes builds on Java versions newer than 8)

Build the thing:

mvn -Dmaven.test.skip=true package

Note: Maven will fail if JAVA_HOME env is not set to your jdk location.

The artifact should be in target/cephfs-hadoop-0.80.6.jar.

2. Adding ceph libraries to hadoop

Upload cephfs-hadoop-0.80.6.jar to /usr/lib/hadoop/lib in the hadoop charm.

From the hadoop node, install libcephfs-java

sudo apt install libcephfs-java

Copy the libcephfs.jar to hadoop/lib

cp /usr/share/java/libcephfs.jar /usr/lib/hadoop/lib

Hadoop won’t always load the libraries, so we need to add them to hadoop’s classpath. Add the following to /etc/hadoop/conf/hadoop-env.sh:

export HADOOP_CLASSPATH="/usr/lib/hadoop/lib/cephfs-hadoop-0.80.6.jar"
export HADOOP_CLASSPATH="/usr/lib/hadoop/lib/libcephfs.jar:$HADOOP_CLASSPATH"

3. Configure hadoop

Remove default FS prop from /etc/hadoop/conf/core-site.xml and add the following properties:

<property>
  <name>fs.ceph.impl</name>
  <value>org.apache.hadoop.fs.ceph.CephFileSystem</value>
</property>

<property>
  <name>ceph.conf.file</name>
  <value>/etc/ceph/ceph.conf</value>
</property>

<property>
  <name>fs.default.name</name>
  <value>ceph:///</value>
</property>

The rest of the information needed to connect to ceph (monitor addresses, fsid, etc) should be in /etc/ceph/ceph.conf.

Example ceph.conf configuration:

[global]
auth cluster required = cephx
auth service required = cephx
auth client required = cephx

mon host = 172.31.103.152:6789 172.31.103.176:6789 172.31.103.199:6789
fsid = 050fc712-25bb-11e9-8024-02b84229d652

log to syslog = false
err to syslog = false
clog to syslog = false
mon cluster log to syslog = false
debug mon = 1/5
debug osd = 1/5

mon pg warn max object skew = -1


public network =
cluster network =
public addr = 172.31.103.92
cluster addr = 172.31.103.199


[mon]
keyring = /var/lib/ceph/mon/$cluster-$id/keyring


[mds]
keyring = /var/lib/ceph/mds/$cluster-$id/keyring

4. Verify the connection

If all is well, the following will list the contents in the root of the ceph file system:

hadoop fs -ls /