New BigData Charm Initiative

New BigData Initiative

We (@tyler and myself) have started to cut new versions of the ASF charms to facilitate our needs to run more recent releases of the software than can be found in the bigtop releases/upstream charms. We quickly set sight on a few additional set of goals in writing these charms in order to accommodate our environment’s needs for multi-homed networking and managed storage within some of the ASF components.

The primary goal of these charms is to let a user deploy spark workloads in a multi-homed network environment + leveraging juju storage to facilitate the software components individual storage needs AND the ability to be able to deploy whatever version of the upstream software the user please by supplying the tarball of the software as a resource.

Some key features:

  • New big data development workflow component: Conda charm.
  • Usability enhancement: Spark configuration for use with radosgw or aws s3 out of the box.
  • Network space support - Zookeeper, Jupyter-Notebook, Spark.
  • Juju storage support: Zookeeper
  • S3 Support (rough): Jupyter-Notebook, Spark

Note: We had to cut these charms in a time box so they have a quite a bit of room for improvement across the board. I basically just wanted to get them out there and working and then start the process of iterating on them to be more generally useful. They have some rough edges no doubt that we intend to smooth out pretty quickly.

Below are some of the starts. More to come in the near future.

Zookeeper

Charmstore: https://jujucharms.com/u/omnivector/zookeeper/
Github:
* layer-zookeeper - https://github.com/omnivector-solutions/layer-zookeeper
* interface-zookeeper - https://github.com/omnivector-solutions/interface-zookeeper

Spark

Charmstore: https://jujucharms.com/u/omnivector/spark/
Github:
* layer-spark - https://github.com/omnivector-solutions/layer-spark
* layer-spark-base - https://github.com/omnivector-solutions/layer-spark-base
* layer-hadoop-base - https://github.com/omnivector-solutions/layer-hadoop-base
* interface-spark - https://github.com/omnivector-solutions/interface-spark

Jupyter-Notebook + Spark

Charmstore: https://jujucharms.com/u/omnivector/jupyter-notebook/
Github:
* layer-jupyter-notebook - https://github.com/omnivector-solutions/layer-jupyter-notebook

Conda

Charmstore: https://jujucharms.com/u/omnivector/conda/
Github:
* layer-conda - https://github.com/omnivector-solutions/layer-conda
* layer-conda-api - https://github.com/omnivector-solutions/layer-conda-api

Jupyter-Notebook + Conda + Spark

Just an example of how these come together. The object storage gateway could be an aws s3 endpoint or ceph object storage gateway endpoint. This stack is primarily used for a spark standalone cluster use case, but the jupyter notebook built with layer-spark make the jupyter-notebook charm alone a great way to interface to deploying spark 2.4.x workloads to k8s via jupyter notebook. Here is the bundle I’ve been beating on.

series: bionic
applications:
  spark:
    charm: cs:~omnivector/spark
    constraints: "tags=bdx-test spaces=mgmt,access"
    num_units: 3
    options:
      object-storage-gateway: "<object-storage-endpoint-url>"
      aws-access-key: "<s3-access-key>"
      aws-secret-key: "<s3-secret-key>"
    bindings:
      "": mgmt
      spark: access
  jupyter-notebook:
    charm: cs:~omnivector/jupyter-notebook
    constraints: "tags=bdx-test spaces=mgmt,access"
    num_units: 1
    options:
      object-storage-gateway: "<object-storage-endpoint-url>"
      aws-access-key: "<s3-access-key>"
      aws-secret-key: "<s3-secret-key>"
    bindings:
      "": mgmt
      http: access
  conda:
    charm: cs:~omnivector/conda
    num_units: 0
    options:
      conda-extra-packages: "pyspark=2.4.0 numpy ipykernel pandas pip"
      conda-extra-pip-packages: "psycopg2 Cython git+https://<oauthkey>:x-oauth-basic@github.com/<my-private-org>/<my-private-repo>@master"
relations:
- - spark:juju-info
  - conda:juju-info
- - jupyter-notebook:juju-info
  - conda:juju-info
Model    Controller  Cloud/Region  Version  SLA          Timestamp
spark01  pdl-maas    pdl-maas      2.5.4    unsupported  03:07:31Z

App               Version  Status  Scale  Charm             Store       Rev  OS      Notes
conda-pdlda                active      6  conda             jujucharms   13  ubuntu  
jupyter-notebook           active      1  jupyter-notebook  jujucharms   19  ubuntu  
pdl-bdx-conda00            active      6  conda             jujucharms   13  ubuntu  
spark             2.4.1    active      5  spark             jujucharms   14  ubuntu  

Unit                  Workload  Agent  Machine  Public address  Ports                                          Message
jupyter-notebook/57*  active    idle   127      10.10.11.29     8888/tcp                                       http://10.100.211.10:8888
  conda-pdlda/11      active    idle            10.10.11.29                                                    Conda Env Installed: conda-pdlda
  pdl-bdx-conda00/5   active    idle            10.10.11.29                                                    Conda Env Installed: pdl-bdx-conda00
spark/123             active    idle   128      10.10.11.35     7078/tcp,8081/tcp                              Services: worker
  conda-pdlda/6*      active    idle            10.10.11.35                                                    Conda Env Installed: conda-pdlda
  pdl-bdx-conda00/3*  active    idle            10.10.11.35                                                    Conda Env Installed: pdl-bdx-conda00
spark/124*            active    idle   129      10.10.11.31     7077/tcp,7078/tcp,8080/tcp,8081/tcp,18080/tcp  Running: master,worker,history
  conda-pdlda/10      active    idle            10.10.11.31                                                    Conda Env Installed: conda-pdlda
  pdl-bdx-conda00/2   active    idle            10.10.11.31                                                    Conda Env Installed: pdl-bdx-conda00
spark/125             active    idle   130      10.10.11.37     7078/tcp,8081/tcp                              Services: worker
  conda-pdlda/9       active    idle            10.10.11.37                                                    Conda Env Installed: conda-pdlda
  pdl-bdx-conda00/1   active    idle            10.10.11.37                                                    Conda Env Installed: pdl-bdx-conda00
spark/126             active    idle   131      10.10.11.17     7078/tcp,8081/tcp                              Services: worker
  conda-pdlda/7       active    idle            10.10.11.17                                                    Conda Env Installed: conda-pdlda
  pdl-bdx-conda00/4   active    idle            10.10.11.17                                                    Conda Env Installed: pdl-bdx-conda00
spark/127             active    idle   132      10.10.11.40     7078/tcp,8081/tcp                              Services: worker
  conda-pdlda/8       active    idle            10.10.11.40                                                    Conda Env Installed: conda-pdlda
  pdl-bdx-conda00/0   active    idle            10.10.11.40                                                    Conda Env Installed: pdl-bdx-conda00

Machine  State    DNS          Inst id     Series  AZ  Message
127      started  10.10.11.29  d3-util-03  bionic  d3  Deployed
128      started  10.10.11.35  d3-util-04  bionic  d3  Deployed
129      started  10.10.11.31  d4-util-05  bionic  d4  Deployed
130      started  10.10.11.37  d3-util-01  bionic  d3  Deployed
131      started  10.10.11.17  d4-util-06  bionic  d4  Deployed
132      started  10.10.11.40  d4-util-03  bionic  d4  Deployed

Following deployment of ^ bundle. You should be able to login to the jupyter notebook and start running jobs that have access to your object storage via s3a. In this way you can run distributed spark/pyspark workloads in spark standalone mode using ceph object storage as a backend, eliminating the need for yarn, hadoop, and/or hdfs.

A simple example.

import os
os.environ['PYSPARK_PYTHON'] = '/opt/anaconda/envs/conda/bin/python'

from pyspark.sql import SparkSession
from pyspark import SparkConf

conf = SparkConf()\
    .setAppName('spark_playground')\
    .setMaster('spark://<master-ip-address>:7077')

spark = SparkSession.builder.config(conf=conf).getOrCreate()
sc = spark.sparkContext

sc.textFile("s3a://path/to/your/datafile.txt").take(1)

Now that we have a working zookeeper charm our next step is to circle back around and put more cycles into the spark charm to decouple the node types and make a relation to zookeeper to get spark HA master functionality and shuffle service + shuffle storage working.

Insights, comments, pull requests welcome!

5 Likes