[Tutorial] Spark Workloads Using Juju - Part 2 (Spark Singleton Extended Bundle)


#1

Generally we can think of a spark deploy as something more than just a single node. The following bundle will help get you started in the right direction toward a very usable standalone spark cluster.

Spark Standalone Bundle Core Components

The spark standalone bundle deploy consists of a few different components.

The core components:

The peripheral/monitoring components:

Component Charmstore
telegraf cs:~telegraf
grafana cs:~grafana
prometheus2 cs:~prometheus2

Deploy the Bundle

To deploy this bundle put the following yaml into a file and deploy it with juju deploy bundle.yaml.

bundle.yaml

series: bionic
applications:
  grafana:
    charm: cs:grafana-23
    num_units: 1
    expose: true
    constraints: root-disk=20480 instance-type=t3.large 
  jupyter-notebook:
    charm: cs:~omnivector/jupyter-notebook
    num_units: 1
    expose: true
    constraints: root-disk=51200 instance-type=t3.xlarge 
  prometheus2:
    charm: cs:prometheus2
    num_units: 1
    constraints: root-disk=51200 instance-type=t3.large 
  pyspark:
    charm: cs:~omnivector/conda
    num_units: 0
    options:
      conda-extra-packages: pyspark=2.4.0 numpy ipykernel pandas pip
  spark:
    charm: cs:~omnivector/spark
    num_units: 10
    expose: true
    constraints: root-disk=51200 instance-type=t3.2xlarge
    storage:
      spark-work: ebs-ssd,100G
      spark-local: ebs-ssd,100G
  telegraf:
    charm: cs:telegraf-27
    num_units: 0
relations:
- - pyspark:juju-info
  - spark:juju-info
- - pyspark:juju-info
  - jupyter-notebook:juju-info
- - telegraf:juju-info
  - spark:juju-info
- - grafana:grafana-source
  - prometheus2:grafana-source
- - telegraf:prometheus-client
  - prometheus2:target
- - telegraf:juju-info
  - jupyter-notebook:juju-info

Result

Model       Controller  Cloud/Region   Version  SLA          Timestamp
spark-0000  pdl-aws     aws/us-west-2  2.5.4    unsupported  00:43:35Z

App               Version  Status  Scale  Charm             Store       Rev  OS      Notes
grafana                    active      1  grafana           jujucharms   23  ubuntu  exposed
jupyter-notebook           active      1  jupyter-notebook  jujucharms   32  ubuntu  exposed
prometheus2                active      1  prometheus2       jujucharms    8  ubuntu  
pyspark                    active     11  conda             jujucharms   20  ubuntu  
spark             2.4.1    active     10  spark             jujucharms   35  ubuntu  exposed
telegraf                   active     11  telegraf          jujucharms   27  ubuntu  

Unit                 Workload  Agent  Machine  Public address  Ports                                          Message
grafana/0*           active    idle   28       172.31.104.241  3000/tcp                                       Started grafana-server
jupyter-notebook/0*  active    idle   26       172.31.104.248  8888/tcp                                       http://172.31.104.248:8888
  pyspark/20         active    idle            172.31.104.248                                                 Conda Env Installed: pyspark
  telegraf/20        active    idle            172.31.104.248  9103/tcp                                       Monitoring jupyter-notebook/0
prometheus2/0*       active    idle   27       172.31.104.230  9090/tcp,12321/tcp                             Ready
spark/1*             active    idle   1        172.31.102.231  7077/tcp,7078/tcp,8080/tcp,8081/tcp,18080/tcp  Running: master,worker,history
  pyspark/0*         active    idle            172.31.102.231                                                 Conda Env Installed: pyspark
  telegraf/13        active    idle            172.31.102.231  9103/tcp                                       Monitoring spark/1
spark/13             active    idle   13       172.31.104.237  7078/tcp,8081/tcp                              Services: worker
  pyspark/9          active    idle            172.31.104.237                                                 Conda Env Installed: pyspark
  telegraf/10        active    idle            172.31.104.237  9103/tcp                                       Monitoring spark/13
spark/15             active    idle   15       172.31.103.234  7078/tcp,8081/tcp                              Services: worker
  pyspark/16         active    idle            172.31.103.234                                                 Conda Env Installed: pyspark
  telegraf/9         active    idle            172.31.103.234  9103/tcp                                       Monitoring spark/15
spark/16             active    idle   16       172.31.103.114  7078/tcp,8081/tcp                              Services: worker
  pyspark/19         active    idle            172.31.103.114                                                 Conda Env Installed: pyspark
  telegraf/12        active    idle            172.31.103.114  9103/tcp                                       Monitoring spark/16
spark/17             active    idle   17       172.31.102.13   7078/tcp,8081/tcp                              Services: worker
  pyspark/11         active    idle            172.31.102.13                                                  Conda Env Installed: pyspark
  telegraf/0*        active    idle            172.31.102.13   9103/tcp                                       Monitoring spark/17
spark/18             active    idle   18       172.31.102.155  7078/tcp,8081/tcp                              Services: worker
  pyspark/15         active    idle            172.31.102.155                                                 Conda Env Installed: pyspark
  telegraf/5         active    idle            172.31.102.155  9103/tcp                                       Monitoring spark/18
spark/20             active    idle   20       172.31.102.204  7078/tcp,8081/tcp                              Services: worker
  pyspark/12         active    idle            172.31.102.204                                                 Conda Env Installed: pyspark
  telegraf/11        active    idle            172.31.102.204  9103/tcp                                       Monitoring spark/20
spark/21             active    idle   21       172.31.104.85   7078/tcp,8081/tcp                              Services: worker
  pyspark/14         active    idle            172.31.104.85                                                  Conda Env Installed: pyspark
  telegraf/15        active    idle            172.31.104.85   9103/tcp                                       Monitoring spark/21
spark/24             active    idle   24       172.31.102.238  7078/tcp,8081/tcp                              Services: worker
  pyspark/18         active    idle            172.31.102.238                                                 Conda Env Installed: pyspark
  telegraf/6         active    idle            172.31.102.238  9103/tcp                                       Monitoring spark/24

Machine  State    DNS             Inst id              Series  AZ          Message
1        started  172.31.102.231  i-0e76f0515db3ef90e  bionic  us-west-2a  running
13       started  172.31.104.237  i-0831223f2314c426c  bionic  us-west-2c  running
15       started  172.31.103.234  i-0f17d1e2c9cee938b  bionic  us-west-2b  running
16       started  172.31.103.114  i-0bb906ad5ab4cc227  bionic  us-west-2b  running
17       started  172.31.102.13   i-0e2c2cd380e2b61da  bionic  us-west-2a  running
18       started  172.31.102.155  i-008d16354d8c3e7eb  bionic  us-west-2a  running
20       started  172.31.102.204  i-0abe5d7237ef8d70a  bionic  us-west-2a  running
21       started  172.31.104.85   i-08eaefd1632223fb0  bionic  us-west-2c  running
24       started  172.31.102.238  i-0b56741810fc00e32  bionic  us-west-2a  running
26       started  172.31.104.248  i-0fb7da77bfaf82c5e  bionic  us-west-2c  running
27       started  172.31.104.230  i-0e77e7c0eb2dd5d14  bionic  us-west-2c  running
28       started  172.31.104.241  i-0ffbbc8e3d741bb0c  bionic  us-west-2c  running

Usage

Login to the jupyter notebook at the ip address and port shown in the juju status. Using the example above, the jupyter-notebook can be accessed at http://172.31.104.248:8888.

Once you are logged into the jupyter-notebook ui, go ahead and fire up a new notebook in the environment created by the conda charm.


As you can see, the conda environment will have the same name as the deployed conda charm.
In this case our conda charm application name is ‘pyspark’ and as such we should find a conda environment created with the name ‘pyspark’.

Inform the notebook environment where PYSPARK_PYTHON is by assigning it the value that is the location of the conda environment python. In this example we would use:

os.environ['PYSPARK_PYTHON'] = '/opt/conda/envs/pyspark/bin/python'

Verification

A simple example to verify everything is working (substitute in your own spark-master ip):

import os
os.environ['PYSPARK_PYTHON'] = '/opt/conda/envs/pyspark/bin/python'

from pyspark.sql import SparkSession
from pyspark import SparkConf

import random


conf = SparkConf()\
    .setAppName('JUJU_PI_TEST')\
    .setMaster('spark://<spark-master-ip>:7077')

spark = SparkSession.builder.config(conf=conf).getOrCreate()

sc = spark.sparkContext

num_samples = 100000000

def inside(p):     
  x, y = random.random(), random.random()
  return x*x + y*y < 1

count = sc.parallelize(range(0, num_samples)).filter(inside).count()
pi = 4 * count / num_samples

print(pi)
sc.stop()

Test S3A Works

A simple cell to verify S3A is working correctly across the c;luster. Once deployed, run the following cell in your notebook substituting in your own aws credentials + s3 endpoint, spark master ip, and s3 file location.

import os
os.environ['PYSPARK_PYTHON'] = '/opt/conda/envs/pyspark/bin/python'

from pyspark.sql import SparkSession
from pyspark import SparkConf


conf = SparkConf()\
    .setAppName('JUJU_S3A_TEST')\
    .setMaster('spark://<spark-master-ip>:7077')\
    .set('spark.jars.packages', 'org.apache.hadoop:hadoop-aws:3.1.2')

spark = SparkSession.builder.config(conf=conf).getOrCreate()

sc = spark.sparkContext
sc._jsc.hadoopConfiguration().set("fs.s3a.connection.ssl.enabled", "true")
sc._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "<aws-s3-endpoint>")
sc._jsc.hadoopConfiguration().set("fs.s3a.access.key", "<aws-access-key>")
sc._jsc.hadoopConfiguration().set("fs.s3a.secret.key", "<aws-access-key-id>")

sc.textFile("s3a://a/file/in/my/s3bucket.json").take(1)

Monitoring

Login to the grafana monitoring dashboard by retrieving the ip address from the grafana charm and accessing the grafana web ui.

Access the grafana web ui at: http://172.31.104.241:3000

user: admin
password: juju run-action grafana/0 get-admin-password --wait --format json | jq -r '.[]["results"]["password"]'

#2

Nice!

  • Perhaps fewer than 20 units spark?
  • A few words or links about the aux charms?

#3

Thanks!. The # of units is arbitrary. The point of this is to show that it can be far greater then just a singleton! I do like where you are going with more realistic/sensible defaults for the common man, that will do. I change it to 10 units in the markdown above, Good insight.