The following is an example of a spark context initialized in a jupyter-notebook cell where the jupyter-notebook runs as a container/juju charm on k8s and executes against the same k8s cluster that it runs on by passing the k8s cluster ip to setMaster
and the driver host ip to the spark.driver.host
:
from pyspark import SparkConf
from pyspark.sql import SparkSession
conf = SparkConf()\
.setAppName('JUJU_PI_TEST')\
.setMaster('k8s://https://10.152.183.1:443')\
.set('spark.kubernetes.container.image',
'docker.io/omnivector/spark-2.4.1-hadoop-3.2.0:v1.0.0')\
.set('spark.driver.host', '10.1.74.16')\
.set('spark.driver.port', '41049')
spark = SparkSession.builder.config(conf=conf).getOrCreate()
sc = spark.sparkContext
In this configuration the jupyter-notebook (where the spark code is executed) serves as the spark driver host, and thus must pass the host/port details about itself to the spark context configuration so that the executors know who/where to talk back to.
The configs I am focusing on automating are:
.set('spark.driver.host', '10.1.74.16')
and
.setMaster('k8s://https://10.152.183.1:443')\
I currently get the config for spark.driver.host
by manually running ip a
in a notebook cell to get the container ip, and look at kubectl get services
to get the CLUSTER-IP
to set in setMaster
.
I want to discuss how I might be able to track these values via juju and provide them via the notebook container runtime environment such that the user will not have to track them and be responsible for filling these things in in the notebook cell.
The jupyter-k8s charm can be found here.
Setting the $SPARK_DRIVER_BIND_ADDRESS
docker env var can be used in place of specifying 'spark.driver.host'
in the notebook cell/spark context configuration inline, see pyspark docker image entrypoint.sh. I’m thinking if I can get the ip address of the container via charm code, then I could render the env var into the pod spec.
This leaves me with two solid questions.
From a charm’s perspective:
A) How do I get the ip address of the container the charm is written for (guessing that unit_get('private-address')
isn’t yet adapted to work with kubernetes charms)?
B) How can I get the cluster-ip
as provided by kubectl get services
and/or run administrative operations (kubectl
commands) from a charm?
I’m thinking about writing wrappers that would run as charm code on the juju operator pod and return what I want to know about each container/pod from parsing the output of kubectl
. Seems a bit hack-y though … I don’t really know.
Looking for some feedback.
Thanks!