[Tutorial] Spark Workloads Using Juju - Part 2 (Spark Singleton Extended Bundle)


Generally we can think of a spark deploy as something more than just a single node. The following bundle will help get you started in the right direction toward a very usable standalone spark cluster.

Spark Standalone Bundle Core Components

The spark standalone bundle deploy consists of a few different components.

The core components:

The peripheral/monitoring components:

Component Charmstore
telegraf cs:~telegraf
grafana cs:~grafana
prometheus2 cs:~prometheus2

Deploy the Bundle

To deploy this bundle put the following yaml into a file and deploy it with juju deploy bundle.yaml.


series: bionic
    charm: cs:grafana-23
    num_units: 1
    expose: true
    constraints: root-disk=20480 instance-type=t3.large 
    charm: cs:~omnivector/jupyter-notebook
    num_units: 1
    expose: true
    constraints: root-disk=51200 instance-type=t3.xlarge 
    charm: cs:prometheus2
    num_units: 1
    constraints: root-disk=51200 instance-type=t3.large 
    charm: cs:~omnivector/conda
    num_units: 0
      conda-extra-packages: pyspark=2.4.0 numpy ipykernel pandas pip
    charm: cs:~omnivector/spark
    num_units: 10
    expose: true
    constraints: root-disk=51200 instance-type=t3.2xlarge
      spark-work: ebs-ssd,100G
      spark-local: ebs-ssd,100G
    charm: cs:telegraf-27
    num_units: 0
- - pyspark:juju-info
  - spark:juju-info
- - pyspark:juju-info
  - jupyter-notebook:juju-info
- - telegraf:juju-info
  - spark:juju-info
- - grafana:grafana-source
  - prometheus2:grafana-source
- - telegraf:prometheus-client
  - prometheus2:target
- - telegraf:juju-info
  - jupyter-notebook:juju-info


Model       Controller  Cloud/Region   Version  SLA          Timestamp
spark-0000  pdl-aws     aws/us-west-2  2.5.4    unsupported  00:43:35Z

App               Version  Status  Scale  Charm             Store       Rev  OS      Notes
grafana                    active      1  grafana           jujucharms   23  ubuntu  exposed
jupyter-notebook           active      1  jupyter-notebook  jujucharms   32  ubuntu  exposed
prometheus2                active      1  prometheus2       jujucharms    8  ubuntu  
pyspark                    active     11  conda             jujucharms   20  ubuntu  
spark             2.4.1    active     10  spark             jujucharms   35  ubuntu  exposed
telegraf                   active     11  telegraf          jujucharms   27  ubuntu  

Unit                 Workload  Agent  Machine  Public address  Ports                                          Message
grafana/0*           active    idle   28  3000/tcp                                       Started grafana-server
jupyter-notebook/0*  active    idle   26  8888/tcp                             
  pyspark/20         active    idle                                                   Conda Env Installed: pyspark
  telegraf/20        active    idle    9103/tcp                                       Monitoring jupyter-notebook/0
prometheus2/0*       active    idle   27  9090/tcp,12321/tcp                             Ready
spark/1*             active    idle   1  7077/tcp,7078/tcp,8080/tcp,8081/tcp,18080/tcp  Running: master,worker,history
  pyspark/0*         active    idle                                                   Conda Env Installed: pyspark
  telegraf/13        active    idle    9103/tcp                                       Monitoring spark/1
spark/13             active    idle   13  7078/tcp,8081/tcp                              Services: worker
  pyspark/9          active    idle                                                   Conda Env Installed: pyspark
  telegraf/10        active    idle    9103/tcp                                       Monitoring spark/13
spark/15             active    idle   15  7078/tcp,8081/tcp                              Services: worker
  pyspark/16         active    idle                                                   Conda Env Installed: pyspark
  telegraf/9         active    idle    9103/tcp                                       Monitoring spark/15
spark/16             active    idle   16  7078/tcp,8081/tcp                              Services: worker
  pyspark/19         active    idle                                                   Conda Env Installed: pyspark
  telegraf/12        active    idle    9103/tcp                                       Monitoring spark/16
spark/17             active    idle   17   7078/tcp,8081/tcp                              Services: worker
  pyspark/11         active    idle                                                    Conda Env Installed: pyspark
  telegraf/0*        active    idle     9103/tcp                                       Monitoring spark/17
spark/18             active    idle   18  7078/tcp,8081/tcp                              Services: worker
  pyspark/15         active    idle                                                   Conda Env Installed: pyspark
  telegraf/5         active    idle    9103/tcp                                       Monitoring spark/18
spark/20             active    idle   20  7078/tcp,8081/tcp                              Services: worker
  pyspark/12         active    idle                                                   Conda Env Installed: pyspark
  telegraf/11        active    idle    9103/tcp                                       Monitoring spark/20
spark/21             active    idle   21   7078/tcp,8081/tcp                              Services: worker
  pyspark/14         active    idle                                                    Conda Env Installed: pyspark
  telegraf/15        active    idle     9103/tcp                                       Monitoring spark/21
spark/24             active    idle   24  7078/tcp,8081/tcp                              Services: worker
  pyspark/18         active    idle                                                   Conda Env Installed: pyspark
  telegraf/6         active    idle    9103/tcp                                       Monitoring spark/24

Machine  State    DNS             Inst id              Series  AZ          Message
1        started  i-0e76f0515db3ef90e  bionic  us-west-2a  running
13       started  i-0831223f2314c426c  bionic  us-west-2c  running
15       started  i-0f17d1e2c9cee938b  bionic  us-west-2b  running
16       started  i-0bb906ad5ab4cc227  bionic  us-west-2b  running
17       started   i-0e2c2cd380e2b61da  bionic  us-west-2a  running
18       started  i-008d16354d8c3e7eb  bionic  us-west-2a  running
20       started  i-0abe5d7237ef8d70a  bionic  us-west-2a  running
21       started   i-08eaefd1632223fb0  bionic  us-west-2c  running
24       started  i-0b56741810fc00e32  bionic  us-west-2a  running
26       started  i-0fb7da77bfaf82c5e  bionic  us-west-2c  running
27       started  i-0e77e7c0eb2dd5d14  bionic  us-west-2c  running
28       started  i-0ffbbc8e3d741bb0c  bionic  us-west-2c  running


Login to the jupyter notebook at the ip address and port shown in the juju status. Using the example above, the jupyter-notebook can be accessed at

Once you are logged into the jupyter-notebook ui, go ahead and fire up a new notebook in the environment created by the conda charm.

As you can see, the conda environment will have the same name as the deployed conda charm.
In this case our conda charm application name is ‘pyspark’ and as such we should find a conda environment created with the name ‘pyspark’.

Inform the notebook environment where PYSPARK_PYTHON is by assigning it the value that is the location of the conda environment python. In this example we would use:

os.environ['PYSPARK_PYTHON'] = '/opt/conda/envs/pyspark/bin/python'


A simple example to verify everything is working (substitute in your own spark-master ip):

import os
os.environ['PYSPARK_PYTHON'] = '/opt/conda/envs/pyspark/bin/python'

from pyspark.sql import SparkSession
from pyspark import SparkConf

import random

conf = SparkConf()\

spark = SparkSession.builder.config(conf=conf).getOrCreate()

sc = spark.sparkContext

num_samples = 100000000

def inside(p):     
  x, y = random.random(), random.random()
  return x*x + y*y < 1

count = sc.parallelize(range(0, num_samples)).filter(inside).count()
pi = 4 * count / num_samples


Test S3A Works

A simple cell to verify S3A is working correctly across the c;luster. Once deployed, run the following cell in your notebook substituting in your own aws credentials + s3 endpoint, spark master ip, and s3 file location.

import os
os.environ['PYSPARK_PYTHON'] = '/opt/conda/envs/pyspark/bin/python'

from pyspark.sql import SparkSession
from pyspark import SparkConf

conf = SparkConf()\
    .set('spark.jars.packages', 'org.apache.hadoop:hadoop-aws:3.1.2')

spark = SparkSession.builder.config(conf=conf).getOrCreate()

sc = spark.sparkContext
sc._jsc.hadoopConfiguration().set("fs.s3a.connection.ssl.enabled", "true")
sc._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "<aws-s3-endpoint>")
sc._jsc.hadoopConfiguration().set("fs.s3a.access.key", "<aws-access-key>")
sc._jsc.hadoopConfiguration().set("fs.s3a.secret.key", "<aws-access-key-id>")



Login to the grafana monitoring dashboard by retrieving the ip address from the grafana charm and accessing the grafana web ui.

Access the grafana web ui at:

user: admin
password: juju run-action grafana/0 get-admin-password --wait --format json | jq -r '.[]["results"]["password"]'



  • Perhaps fewer than 20 units spark?
  • A few words or links about the aux charms?


Thanks!. The # of units is arbitrary. The point of this is to show that it can be far greater then just a singleton! I do like where you are going with more realistic/sensible defaults for the common man, that will do. I change it to 10 units in the markdown above, Good insight.