Kubeflow charms now available


#1

The kubeflow charms from the juju-solutions repo have been uploaded to the charm store under the ~juju namespace.

Overview

Kubeflow is a collection of a few loosely related components that provide easy ways of running machine learning code in various forms within Kubernetes. JupyterHub provides an interactive notebook interface with TensorFlow and other libraries pre-installed, TensorFlow Training or PyTorch Training allow you to train models which can then be served with either TensorFlow Serving or Seldon, and TensorFlow Dashboard provides a management interface for TensorFlow Training or Serving jobs.

The Juju Kubeflow Charms provide an alternative to Ksonnet for deploying Kubeflow.

Prerequisites

You will need any Kubernetes cluster, plus a Juju 2.5 controller (2.5 rc1 or better).

Two suggested configurations are: a LXD controller with microk8s, or Kubernetes deployed to a cloud with the integrator charm for that cloud.

Deploying Kubeflow

The easiest way to get things going is to deploy the kubeflow bundle.

juju deploy cs:kubeflow

The following is a list of all the available Kubeflow charms:

cs:~juju/kubeflow-tf-hub JupyterHub with Kubeflow libraries and settings
cs:~juju/kubeflow-tf-job-dashboard TensorFlow Dashboard
cs:~juju/kubeflow-tf-serving TensorFlow Serving
cs:~juju/kubeflow-seldon-cluster-manager Seldon Serving
cs:~juju/kubeflow-seldon-api-frontend Seldon Serving Frontend
cs:~juju/kubeflow-tf-job-operator TensorFlow Training
cs:~juju/kubeflow-pytorch-operator PyTorch Training
cs:~juju/kubeflow-ambassador Ambassador API Gateway

Some of the charms have additional options you may want to set, in particular the notebook-storage-size option for cs:~juju/kubeflow-tf-hub to attach persistent storage for the notebooks. These options can be set in the bundle or via the deploy CLI.

juju deploy cs:~juju/kubeflow-tf-hub
    --config notebook-storage-size=10Mi

Note: TensorFlow Serving is a special case, in that you will deploy a separate instance of the charm for each model you wish to serve, and provide that model via either a resource or charm config as a URL.

Note: depending on the k8s cluster and the undercloud, you may also need to deploy the Hub with a LoadBalancer service in order to allow ingress. This is not necessary for CDK deployed on AWS with the integrator charm for example.

juju deploy cs:~juju/kubeflow-tf-hub
    --config notebook-storage-size=10Mi
    --config kubernetes-service-type=LoadBalancer

Using Kubeflow

Once deployed, go to the JupyterHub service endpoint and login with any username and password and click the Start My Server button. You can run juju status on to see the JupyterHub application address to get the IP address to which to connect. The port is 8000.

If desired, use the form to choose the version, CPU / GPU, or memory resources. If the form is left blank, the latest version and reasonable defaults will be used. Then click Submit.

A new Jupyter Notebook pod will be created for your user (with persistent storage attached, if configured) and you will be taken to a file browser interface.

Select New -> Notebook from the top right to create a notebook.

You can then run some ML code. The example from the Kubeflow User Guide is:

from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)
 
import tensorflow as tf
 
x = tf.placeholder(tf.float32, [None, 784])
 
W = tf.Variable(tf.zeros([784, 10]))
b = tf.Variable(tf.zeros([10]))
 
y = tf.nn.softmax(tf.matmul(x, W) + b)
 
y_ = tf.placeholder(tf.float32, [None, 10])
cross_entropy = tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(y), reduction_indices=[1]))
 
train_step = tf.train.GradientDescentOptimizer(0.05).minimize(cross_entropy)
 
sess = tf.InteractiveSession()
tf.global_variables_initializer().run()
 
for _ in range(1000):
  batch_xs, batch_ys = mnist.train.next_batch(100)
  sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})
 
correct_prediction = tf.equal(tf.argmax(y,1), tf.argmax(y_,1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
print(sess.run(accuracy, feed_dict={x: mnist.test.images, y_: mnist.test.labels}))

When run, this should result in something around 0.9014 being printed as the calculated accuracy.

TensorFlow Training

To submit models to be trained, you must create a TFJob custom resource in Kubernetes. For example, to submit the distributed mnist model, which is used for e2e testing, you can need to follow the instructions here to build then docker image locally:

https://github.com/kubeflow/tf-operator/tree/master/examples/v1alpha2/dist-mnist

Then:

kubectl create -n $namespace -f https://raw.githubusercontent.com/kubeflow/tf-operator/master/examples/v1alpha2/dist-mnist/tf_job_mnist.yaml

Note: The namespace is the name of the Kubernetes model in Juju that this charm is deployed into.

You can then check on the status of the job via either the TensorFlow Dashboard, or kubectl:

kubectl get -o yaml -n $namespace tfjobs dist-mnist-for-e2e-test

PyTorch Training

To submit models to be trained, you must create a PyTorchJob custom resource in Kubernetes. For example, to submit the distributed mnist model, which is used for e2e testing, you can use:

kubectl create -n $namespace -f https://raw.githubusercontent.com/kubeflow/pytorch-operator/master/examples/mpi-dist/mnist/cpu/v1beta1/mpi_mnist_job_cpu.yaml

More details about setting up pytorch jobs are found here:

https://github.com/kubeflow/pytorch-operator

Note: The namespace is the name of the Kubernetes model in Juju that this charm is deployed into.

You can then check the status of the job via either the TensorFlow Dashboard, or kubectl:

kubectl get -o yaml -n $namespace pytorchjobs dist-mnist-for-e2e-test
TensorFlow Serving

A separate instance of this charm should be deployed for each model to serve, with the model being provided as either a URL in charm config, or via a resource. For example, to serve the inception model, you would deploy it as:

juju deploy cs:~juju/kubeflow-tf-serving inception \
    --config model=gs://kubeflow-models/inception

You would then point the inception_client to port 9000 on the LB address.

Seldon Serving

This charm must be deployed to a Kubernetes model in Juju and related to redis:

juju deploy cs:~juju/kubeflow-seldon-cluster-manager
juju deploy cs:~juju/redis-k8s
juju add-relation kubeflow-seldon-cluster-manager redis-k8s

To submit models to be trained or served, you must create a SeldonDeployment custom resource. Currently, the custom resource definition for this must be loaded manually via:

kubectl create -n $juju_model_name -f https://raw.githubusercontent.com/juju-solutions/charm-kubeflow-seldon-cluster-manager/start/files/crd-v1alpha1.yaml

The specific SeldonDeployment that you create will depend on how and what image you are wanting to serve, but a simple example might look like:

apiVersion: machinelearning.seldon.io/v1alpha1
kind: SeldonDeployment
metadata:
  labels:
    app: seldon
  name: mymodel
  namespace: default
spec:
  annotations:
    deployment_version: v1
    project_name: mymodel
  name: mymodel
  predictors:
  - annotations:
      predictor_version: v1
    componentSpec:
      spec:
        containers:
        - image: seldonio/mock_classifier:1.0
          imagePullPolicy: Always
          name: mymodel
          volumeMounts: []
        terminationGracePeriodSeconds: 1
        volumes: []
    graph:
      children: []
      endpoint:
        type: REST
      name: mymodel
      type: MODEL
    name: mymodel
    replicas: 1

#2

This is a really great overview. Thanks for putting so much work in @wallyworld.

We should look into expanding Carmine’s kaggle tutorial to make use of these charms. One of the questions in his webinar was “so, does this work on AWS?” and his answer was more or less “No, sorry” because he had written bash scripts to deploy onto GCE. It would be awesome to say, “You bet! You can run this anywhere!”


#3

It was @cory_fu who did the hard work - I just transcribed his Google doc and uploaded the charms to the production store once the store gained the ability to host k8s charms.


#4

Ah right. It’s sort of a shame that Discourse doesn’t allow users to nominate someone else as the author.


#5

I will work on expanding the kaggle tutorial and add this in.


#6

I’ve just successfully created the kubeflow cluster, but the moment I tried to spawn the new jhub server - this is what I’m getting:

500 : Internal Server Error
Redirect loop detected. Notebook has jupyterhub version 0.9.1, but the Hub expects 0.9.0.dev. Try installing jupyterhub==0.9.0.dev in the user environment if you continue to have problems.
You can try restarting your server from the home page.

Any ideas @wallyworld?


#7

I haven’t seen that issue. I just retested the initial ML Python example with success. The steps I used:

microk8s.reset
juju bootstrap lxd
microk8s.config | juju add-k8s k8stest
juju add-model test k8stest
juju create-storage-pool operator-storage kubernetes storage-class=microk8s-hostpath
juju deploy cs:~juju/kubeflow

I then went to the tf-hub web page, logged in, and created a notebook as per the instructions.

I did notice that some of the subsequent github file references for the pytorch training (and others) had become out of date since the instructions were first written months ago. I’ve attempted to update to fix the links but haven’t fully tested everything yet.


#8

Not sure what the root cause is but the code for sure needs more flexibility.
Currently jupyterhub images is pulled from custom repository. If I try to pull generic image of any version from gcr.io the container doesn’t start complaining about missing libraries imported by the config script.
If I try to specify the another notebook image but the default - it doesn’t work also. We definitely need Cory to take a look at this.


#9

Finally made it all running (All this I’m trying to do on baremetal).
First, I had to dynamically update jupyterhub to 0.9.1 when running the pod (originally it is running as 0.9.0.dev).
Next, withing the image there is a tensorflow pip package installed, its version is 1.8.0. It is required for running an example above. Based on https://github.com/tensorflow/tensorflow/issues/17411 there is no way to run this job with no CUDA cores, if the version of tensorflow is above 1.5. Downgrading tensorflow to 1.5 inside the container solves the problem and allows to run the test job.


#10

The Kubeflow charms at this stage are very much demoware rather than production ready.
The source can be found here: https://github.com/juju-solutions?q=kubeflow

As time permits, they will be improved. Patches gratefully accepted.