How do we make it easier to educate users on how to troubleshoot issues they encounter and participate fixing them?


#1

One thing that we need to work on as a product is figuring out how to make it easy for users to fix problems they encounter. At the moment, something breaks and they immediately think “ffs juju”. Or perhaps “juju 🤦”, as used in extract below.

Currently, Juju just throws its hands up and says something like “hook failed” and leaves the user with 0 options. We kill any sense of control and make users feel frustrated and stuck.

It’s a multi-faceted problem. Users think that Juju is misbehaving, but from Juju’s point of view - charms are misbehaving. But charm authors usually depending on upstream layers. So charms feel like other charms are misbehaving. And so it’s way too difficult for users to identify the actual issue and report a bug to the right place, let alone fix problems they encounter in production.

Here’s a snippet from one of Canonical’s internal IRC channels. I think it’s representative.

[11:33:47]<tartley> my first error from 'juju status' in staging, pre-deploy:
[11:33:50]<tartley> snapdevicegw-r050f610/1         error     idle   81       10.50.78.177                                                                               hook failed: "store-services-relation-changed"
[11:33:58]<tartley> No idea what I should do about this. Any clues?
[11:34:19]<roadmr> hm
[11:35:00]<tartley> doesn't go away when I 'juju status' again. :-)
[11:35:29]<roadmr> tartley: it has to do with verterok's change for the port thingy
[11:35:30]<roadmr> memcache_port = memcache.get_remote_all('port')[0]
[11:35:38]<roadmr> IndexError: list index out of range
[11:35:56]<verterok> tartley, roadmr back
[11:35:58]<verterok> looking
[11:36:25]<roadmr> thanks, I was about to go back to the MP to figure this out
[11:38:10]<verterok> the weird part, only one unit failed
[11:38:14]<verterok> thanks juju
[11:38:42]<verterok> tartley: taking a closer look, will free the env for you in a bit 
[11:40:03]<roadmr> juju 🤦  
[11:40:22]<tartley> verterok, okdokes, thanks for the assist, no hurry on my part.
[11:40:34]<verterok> will just destroy it and land some defensive code...
[11:40:58]<tartley> in fact, I'll step away a minute then, get kiddo dinner started, back soon...
[11:41:12]* tartley is now known as tartley|BRB
[11:41:20]<timclicks> roadmr: which charm/layer is this? is it a public memcache layer?
[11:42:47]<roadmr> timclicks: it's our (snapdevicegw) charm trying to get the port for those memcache units
[11:44:37]<verterok> timclicks: hi
[11:44:50]<verterok> timclicks: is using memcached layer
[11:44:59]<verterok> but is a pinned version
[11:45:01]* verterok checks
[11:45:24]<verterok> timclicks: the weird part is that 1/2 units got the port just fine
[11:45:47]<timclicks> :/
[11:46:01]<verterok> the charm is using memcache.get_remote_all('port')
[11:46:15]<verterok> as memcache_hosts() only return the IPs and not the ports
[11:46:56]<verterok> we are using memcache interface 5ccddd9 
[11:49:05]<timclicks> I wonder if the call could (optionally?) block until a port is available
[11:49:40]<verterok> actually b3245f167e9cfbe10de1f88553420347c588eec7
[11:49:48]<timclicks> if one of the units gets the right info, then it's likely the 2nd one will eventually right itself
[11:49:58]<timclicks> once the data propagates through the system
[11:50:04]<verterok> it could...but it's relying in the available state
[11:50:16]<verterok> :) 
[11:50:25]<timclicks> going back to tartley|BRB's question, perhaps waiting for a minute, then "juju resolve --all" 
[11:51:00]<verterok> tried that, still failed
[11:51:04]<timclicks> sign
[11:51:09]<timclicks> *sigh
[11:51:36]<verterok> the code is: https://git.launchpad.net/snapdevicegw/tree/charm/reactive/snapdevicegw.py#n99
[11:52:06]<timclicks> this is one of those cases where the juju team abstains responsibility because it's a charming bug, but the charm writers don't know what to do because juju makes state management too hard
[11:52:18]<verterok> tartley|BRB: staging is clear now
[11:52:49]<timclicks> meanwhile users just see a borked system 
[11:54:29]<timclicks> this definitely feels like a bug in the memcache layer. If memcache.available is True, then there should be a remote port to access
[11:54:55]<verterok> indeed
[11:55:08]<verterok> the interface is buggy
[11:55:10]<timclicks> that's outside of my area of expertise
[11:55:28]<verterok> https://github.com/omnivector-solutions/interface-memcache
[11:55:36]<verterok> I can probably fix it
[11:55:55]<verterok> instead of workaround it in our charm
[11:56:00]<timclicks> james beedy (bdx) on freenode, is a very active member of the community
[11:56:09]<timclicks> and he would accept patches very quickly
[11:56:19]<verterok> will send him a PR
[11:56:32]<verterok> thx for the pointer
[11:57:13]<timclicks> hopefully we can reduce the level of juju-facepalm in the future!
[11:57:20]<timclicks> (please don't make that an actual juju plugin)
[11:57:56]<verterok> roadmr: 2 ^
[11:58:01]<verterok> s/2//
[11:58:13]<verterok> your next plugin?
[11:58:47]* verterok EODs
[11:58:49]<roadmr>  I don't recall if emoju has a 🤦 command, checking
[11:59:16]<roadmr> no, no facepalm, i'll add it :) 

#2

Yeah, figuring out problems with juju is close to debugging assembler. Insanely powerful, steep learning curve.

For one, perhaps make the debug-log of juju more similar to a “-v” “-vv” “-vvv” … “-vvvvvvvvv” like experience?

  • A “juju debug-error” plugin perhaps … or “juju debug-log -vvvv” ?
  • An easier filter usage, like: "juju debug-log " (instead of: juju debug-log --include ) ?
  • Have juju indicate where in the “state machine” something goes wrong… (Like clearly indicate in the log-output" in which state juju is. Could perhaps be part of the “juju status” output?
  • (Preventive meassures) Having charmstore be a bit more picky about some charms. Like, for example, automatically rate charms based on a weighted score on "age, number-or-supported-series, number-of-supported-os:es, “low number of warnings of charm proof”, “existence of a good README.md”, “existence of a getstarted.md”, “existence of a non-default-icon”, “existence of a license/copyright”, “existence of a repo and bug site”, etc… Anything to allow the community to produce high-quality-charms by game-ify the charming process in direction of fame+glory+quality…

#3

Juju is interesting because there’s a large number of moving parts working together. There’s some cool stuff we could do to help make clear things like

“The charm says …”
“The Juju client says …”
“The underlying provider says …”
… and so on

Often it confuses folks and I think that it’s actually a hint that status could use a rethink to help clarify the status of “who”. A status line can mix the status of the unit agent, the charm messaging, and more.


#4

I’d rather have a human readable output format for juju, that could do better at explaining errors. At the moment a lot of the errors we spew out are a mix between human and machine readable. I don’t think it’s the best of both worlds.

This is a good source of inspiration (compile errors for humans), I’d even go further than “working with compiler error messages sucks” to “working with any error messages sucks”.

If we can provide better guidance then I’m sure half of the issues users are hitting would dissolve through; ok I’ve hit a problem, now what? and the error messages could give better pointers.


#5

I thought it might be useful to give some insight to where I believe we could improve, but just to give you a showcase. Previously typing a command that was wrong resulted in a confusing and annoying message, I updated this to be more helpful:

juju controller 
ERROR juju: "controller" is not a juju command. See "juju --help".

Did you mean:
	controllers

As you can see it helps the user by giving a way forward rather than a dead end.

A typical example of where we do provide dead ends is the follow:

juju models test
ERROR unrecognized args: ["test"]

Essentially I missed -c to identify the controller, I think we could be a lot better here.

juju models test
ERROR juju: "tests" is not a valid argument. See "juju models --help".

Did you mean:
        models -c test

Although a very contrived example, it does prove a point. If we want users (operators) to love juju we have to help them when learning or making mistakes. That way juju becomes less a ball of cryptic commands and becomes more human.


#6

One thing that I really love about Rust is that every error is coded. Those codes refer to a doc that explains situations where the error is encountered and often how provides a suggestion or two about how to fix things.

I don’t know how practical it would be to retrofit some of that into Juju… but it might work well in the charms/hooks space.


#7

My fav example of that is this sequence that was (is?) triggered before bootstrapping anywhere:

$ juju status
ERROR No selected controller.

Please use "juju switch" to select a controller.

$ juju switch
ERROR no currently specified model
$ juju add-model
ERROR model name is required
$ juju add-model testing
ERROR No selected controller.

Please use "juju switch" to select a controller.

#8

That feels like a bug to be filed where switch would also detect the “no selected controller” situation and giving you an error showing you a list of controllers. Good example.


#9

I’ve made a start, cleaning up when we don’t supply a model name. Spending some dedicated time on this might be worth while in the future.

https://github.com/juju/juju/pull/10885


#10

This just landing in develop :smiley: