CDK Heapster Crashlooping


I’ve deployed CDK as a bundle to a MAAS cloud. Deployment went well, but then after grabbing the kubeconfig and starting to interact with the cluster, I noticed all is not well. Initially having trouble accessing the dashboard, I’ve found that several core services are crashlooping. (Edited for actual issue in post #2.)
I feel like I’m missing something trivial here, but the mire of K8s hoops I’ve jumped through to reach this point are making it confusing to trace back to the core problem. Please can someone offer any pointers?


EDIT: so the heapster deployment is crashlooping:

I1226 10:41:15.694885       1 heapster.go:79] /heapster --source=kubernetes.summary_api:https://kubernetes.default?kubeletPort=10250&kubeletHttps=true --sink=influxdb:http://monitoring-influxdb:8086
I1226 10:41:15.694985       1 heapster.go:80] Heapster version v1.6.0-beta.1
I1226 10:41:15.695452       1 configs.go:61] Using Kubernetes client with master "https://kubernetes.default" and version v1
I1226 10:41:15.695488       1 configs.go:62] Using kubelet port 10250
I1226 10:41:15.796324       1 influxdb.go:312] created influxdb sink with options: host:monitoring-influxdb:8086 user:root db:k8s
I1226 10:41:15.796380       1 heapster.go:203] Starting with InfluxDB Sink
I1226 10:41:15.796396       1 heapster.go:203] Starting with Metric Sink
I1226 10:41:16.298315       1 heapster.go:113] Starting heapster on port 8082
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1724540]
 goroutine 108 [running]:*summaryMetricsSource).decodeEphemeralStorageStatsForContainer(0xc4202e2f00, 0xc420769490, 0xc42069d4f0, 0xc42069d540)
	/go/src/ +0x90*summaryMetricsSource).decodeContainerStats(0xc4202e2f00, 0xc420b52930, 0xc420af9a68, 0x0, 0x298ce20)
	/go/src/ +0x537*summaryMetricsSource).decodePodStats(0xc4202e2f00, 0xc420b526c0, 0xc420b526f0, 0xc420af9bd8)
	/go/src/ +0x91e*summaryMetricsSource).decodeSummary(0xc4202e2f00, 0xc4201d3500, 0x0)
	/go/src/ +0x351*summaryMetricsSource).ScrapeMetrics(0xc4202e2f00, 0xed3b551bc, 0xc400000000, 0x296aa60, 0xed3b551f8, 0xc400000000, 0x296aa60, 0x296c008, 0x0, 0xc42000e913)
	/go/src/ +0x116, 0xc4202e2f00, 0xed3b551bc, 0xc400000000, 0x296aa60, 0xed3b551f8, 0x0, 0x296aa60, 0x0, 0x0, ...)
	/go/src/ +0x11f*sourceManager).ScrapeMetrics.func1(0xc420272990, 0x27bdb00, 0xc4202e2f00, 0xc420268960, 0xed3b551bc, 0x0, 0x296aa60, 0xed3b551f8, 0x0, 0x296aa60, ...)
	/go/src/ +0x155
created by*sourceManager).ScrapeMetrics
	/go/src/ +0x387

I don’t see any logentries that hint to what’s up :confused:

The Juju output looks like this:

seffyroff@Ame:~$ juju status
Model  Controller  Cloud/Region  Version  SLA          Timestamp
base   maas        maas          2.5-rc1  unsupported  02:58:53-08:00

App                    Version       Status  Scale  Charm                  Store       Rev  OS      Notes
ceph-fs                13.2.1+dfsg1  active      1  ceph-fs                jujucharms   36  ubuntu  
ceph-mon               13.2.1+dfsg1  active      3  ceph-mon               jujucharms  354  ubuntu  
ceph-osd               13.2.1+dfsg1  active      4  ceph-osd               jujucharms  380  ubuntu  
easyrsa                3.0.1         active      1  easyrsa                jujucharms  199  ubuntu  
etcd                   3.2.10        active      3  etcd                   jujucharms  352  ubuntu  
flannel                0.10.0        active      5  flannel                jujucharms  360  ubuntu  
kubeapi-load-balancer  1.14.0        active      1  kubeapi-load-balancer  jujucharms  538  ubuntu  exposed
kubernetes-master      1.13.1        active      2  kubernetes-master      jujucharms  542  ubuntu  
kubernetes-worker      1.13.1        active      3  kubernetes-worker      jujucharms  414  ubuntu  exposed

Unit                      Workload  Agent  Machine  Public address  Ports           Message
ceph-fs/0*                active    idle   2/lxd/1                     Unit is ready (1 MDS)
ceph-mon/0                active    idle   0/lxd/0                     Unit is ready and clustered
ceph-mon/1*               active    idle   1/lxd/0                     Unit is ready and clustered
ceph-mon/2                active    idle   2/lxd/0                     Unit is ready and clustered
ceph-osd/0                active    idle   2                     Unit is ready (1 OSD)
ceph-osd/1*               active    idle   1                     Unit is ready (1 OSD)
ceph-osd/2                active    idle   0                     Unit is ready (1 OSD)
ceph-osd/3                active    idle   3                     Unit is ready (1 OSD)
easyrsa/0*                active    idle   0/lxd/1                     Certificate Authority connected.
etcd/0                    active    idle   0/lxd/2     2379/tcp        Healthy with 3 known peers
etcd/1                    active    idle   1/lxd/1     2379/tcp        Healthy with 3 known peers
etcd/2*                   active    idle   2/lxd/2     2379/tcp        Healthy with 3 known peers
kubeapi-load-balancer/0*  active    idle   3/lxd/0     443/tcp         Loadbalancer ready.
kubernetes-master/0*      active    idle   2     6443/tcp        Kubernetes master running.
  flannel/1               active    idle                       Flannel subnet
kubernetes-master/1       active    idle   0     6443/tcp        Kubernetes master running.
  flannel/4               active    idle                       Flannel subnet
kubernetes-worker/0       active    idle   1     80/tcp,443/tcp  Kubernetes worker running.
  flannel/0*              active    idle                       Flannel subnet
kubernetes-worker/1*      active    idle   0     80/tcp,443/tcp  Kubernetes worker running.
  flannel/3               active    idle                       Flannel subnet
kubernetes-worker/2       active    idle   2     80/tcp,443/tcp  Kubernetes worker running.
  flannel/2               active    idle                       Flannel subnet

Machine  State    DNS          Inst id              Series  AZ       Message
0        started  m3yhap               bionic  default  Deployed
0/lxd/0  started  juju-c0ce59-0-lxd-0  bionic  default  Container started
0/lxd/1  started  juju-c0ce59-0-lxd-1  bionic  default  Container started
0/lxd/2  started  juju-c0ce59-0-lxd-2  bionic  default  Container started
1        started  ywywnt               bionic  default  Deployed
1/lxd/0  started  juju-c0ce59-1-lxd-0  bionic  default  Container started
1/lxd/1  started  juju-c0ce59-1-lxd-1  bionic  default  Container started
2        started  n4wbbg               bionic  default  Deployed
2/lxd/0  started  juju-c0ce59-2-lxd-0  bionic  default  Container started
2/lxd/1  started  juju-c0ce59-2-lxd-1  bionic  default  Container started
2/lxd/2  started  juju-c0ce59-2-lxd-2  bionic  default  Container started
3        started  g43qk3               bionic  default  Deployed
3/lxd/0  started  juju-c0ce59-3-lxd-0  bionic  default  Container started


Taking all other variables out of the equation, I deployed kubernetes-core from charm bundle directly to a MAAS controller, using default placement and configs, and Heapster still segfaults upon deployment. I imagine something has changed in the stack somewhere upstream to mess this around, but I welcome being proven wrong on that. Reading upstream docs it appears that Heapster itself is deprecated and no longer supported going forwards?


Hmm, wonder if @tvansteenburgh or @kos.tsakalozos have any insight into this.


Hi @seffyroff. You’re right, Heapster is not supported by upstream any more, having been replaced by metrics-server (which we do ship in CDK). But the Kubernetes default dashboard still relies on Heapster. The dashboard will switch to using metrics-server eventually (hopefully soon), but until then, if you want the dashboard, you need Heapster. All that to say, if you don’t need/want the dashboard, you can juju config kubernetes-master enable-dashboard-addons=false and that will get rid of the dashboard and heapster pods. If you want to keep the dashboard you could try changing the heapster pod to use the older 1.5.4 container image. From looking at the source it doesn’t appear to have the bug you’re hitting.