How do we make it easier to educate users on how to troubleshoot issues they encounter and participate fixing them?

timClicks · 5 November 2019 20:33

One thing that we need to work on as a product is figuring out how to make it easy for users to fix problems they encounter. At the moment, something breaks and they immediately think “ffs juju”. Or perhaps “juju 🤦”, as used in extract below.

Currently, Juju just throws its hands up and says something like “hook failed” and leaves the user with 0 options. We kill any sense of control and make users feel frustrated and stuck.

It’s a multi-faceted problem. Users think that Juju is misbehaving, but from Juju’s point of view - charms are misbehaving. But charm authors usually depending on upstream layers. So charms feel like other charms are misbehaving. And so it’s way too difficult for users to identify the actual issue and report a bug to the right place, let alone fix problems they encounter in production.

Here’s a snippet from one of Canonical’s internal IRC channels. I think it’s representative.

[11:33:47]<tartley> my first error from 'juju status' in staging, pre-deploy:
[11:33:50]<tartley> snapdevicegw-r050f610/1         error     idle   81       10.50.78.177                                                                               hook failed: "store-services-relation-changed"
[11:33:58]<tartley> No idea what I should do about this. Any clues?
[11:34:19]<roadmr> hm
[11:35:00]<tartley> doesn't go away when I 'juju status' again. :-)
[11:35:29]<roadmr> tartley: it has to do with verterok's change for the port thingy
[11:35:30]<roadmr> memcache_port = memcache.get_remote_all('port')[0]
[11:35:38]<roadmr> IndexError: list index out of range
[11:35:56]<verterok> tartley, roadmr back
[11:35:58]<verterok> looking
[11:36:25]<roadmr> thanks, I was about to go back to the MP to figure this out
[11:38:10]<verterok> the weird part, only one unit failed
[11:38:14]<verterok> thanks juju
[11:38:42]<verterok> tartley: taking a closer look, will free the env for you in a bit 
[11:40:03]<roadmr> juju 🤦  
[11:40:22]<tartley> verterok, okdokes, thanks for the assist, no hurry on my part.
[11:40:34]<verterok> will just destroy it and land some defensive code...
[11:40:58]<tartley> in fact, I'll step away a minute then, get kiddo dinner started, back soon...
[11:41:12]* tartley is now known as tartley|BRB
[11:41:20]<timclicks> roadmr: which charm/layer is this? is it a public memcache layer?
[11:42:47]<roadmr> timclicks: it's our (snapdevicegw) charm trying to get the port for those memcache units
[11:44:37]<verterok> timclicks: hi
[11:44:50]<verterok> timclicks: is using memcached layer
[11:44:59]<verterok> but is a pinned version
[11:45:01]* verterok checks
[11:45:24]<verterok> timclicks: the weird part is that 1/2 units got the port just fine
[11:45:47]<timclicks> :/
[11:46:01]<verterok> the charm is using memcache.get_remote_all('port')
[11:46:15]<verterok> as memcache_hosts() only return the IPs and not the ports
[11:46:56]<verterok> we are using memcache interface 5ccddd9 
[11:49:05]<timclicks> I wonder if the call could (optionally?) block until a port is available
[11:49:40]<verterok> actually b3245f167e9cfbe10de1f88553420347c588eec7
[11:49:48]<timclicks> if one of the units gets the right info, then it's likely the 2nd one will eventually right itself
[11:49:58]<timclicks> once the data propagates through the system
[11:50:04]<verterok> it could...but it's relying in the available state
[11:50:16]<verterok> :) 
[11:50:25]<timclicks> going back to tartley|BRB's question, perhaps waiting for a minute, then "juju resolve --all" 
[11:51:00]<verterok> tried that, still failed
[11:51:04]<timclicks> sign
[11:51:09]<timclicks> *sigh
[11:51:36]<verterok> the code is: https://git.launchpad.net/snapdevicegw/tree/charm/reactive/snapdevicegw.py#n99
[11:52:06]<timclicks> this is one of those cases where the juju team abstains responsibility because it's a charming bug, but the charm writers don't know what to do because juju makes state management too hard
[11:52:18]<verterok> tartley|BRB: staging is clear now
[11:52:49]<timclicks> meanwhile users just see a borked system 
[11:54:29]<timclicks> this definitely feels like a bug in the memcache layer. If memcache.available is True, then there should be a remote port to access
[11:54:55]<verterok> indeed
[11:55:08]<verterok> the interface is buggy
[11:55:10]<timclicks> that's outside of my area of expertise
[11:55:28]<verterok> https://github.com/omnivector-solutions/interface-memcache
[11:55:36]<verterok> I can probably fix it
[11:55:55]<verterok> instead of workaround it in our charm
[11:56:00]<timclicks> james beedy (bdx) on freenode, is a very active member of the community
[11:56:09]<timclicks> and he would accept patches very quickly
[11:56:19]<verterok> will send him a PR
[11:56:32]<verterok> thx for the pointer
[11:57:13]<timclicks> hopefully we can reduce the level of juju-facepalm in the future!
[11:57:20]<timclicks> (please don't make that an actual juju plugin)
[11:57:56]<verterok> roadmr: 2 ^
[11:58:01]<verterok> s/2//
[11:58:13]<verterok> your next plugin?
[11:58:47]* verterok EODs
[11:58:49]<roadmr>  I don't recall if emoju has a 🤦 command, checking
[11:59:16]<roadmr> no, no facepalm, i'll add it :)

erik-lonroth · 5 November 2019 23:24

Yeah, figuring out problems with juju is close to debugging assembler. Insanely powerful, steep learning curve.

For one, perhaps make the debug-log of juju more similar to a “-v” “-vv” “-vvv” … “-vvvvvvvvv” like experience?

A “juju debug-error” plugin perhaps … or “juju debug-log -vvvv” ?
An easier filter usage, like: "juju debug-log " (instead of: juju debug-log --include ) ?
Have juju indicate where in the “state machine” something goes wrong… (Like clearly indicate in the log-output" in which state juju is. Could perhaps be part of the “juju status” output?
(Preventive meassures) Having charmstore be a bit more picky about some charms. Like, for example, automatically rate charms based on a weighted score on "age, number-or-supported-series, number-of-supported-os:es, “low number of warnings of charm proof”, “existence of a good README.md”, “existence of a getstarted.md”, “existence of a non-default-icon”, “existence of a license/copyright”, “existence of a repo and bug site”, etc… Anything to allow the community to produce high-quality-charms by game-ify the charming process in direction of fame+glory+quality…

rick_h · 6 November 2019 13:58

Juju is interesting because there’s a large number of moving parts working together. There’s some cool stuff we could do to help make clear things like

“The charm says …”
“The Juju client says …”
“The underlying provider says …”
… and so on

Often it confuses folks and I think that it’s actually a hint that status could use a rethink to help clarify the status of “who”. A status line can mix the status of the unit agent, the charm messaging, and more.

simonrichardson · 6 November 2019 14:56

I’d rather have a human readable output format for juju, that could do better at explaining errors. At the moment a lot of the errors we spew out are a mix between human and machine readable. I don’t think it’s the best of both worlds.

This is a good source of inspiration (compile errors for humans), I’d even go further than “working with compiler error messages sucks” to “working with any error messages sucks”.

If we can provide better guidance then I’m sure half of the issues users are hitting would dissolve through; ok I’ve hit a problem, now what? and the error messages could give better pointers.

simonrichardson · 7 November 2019 11:42

I thought it might be useful to give some insight to where I believe we could improve, but just to give you a showcase. Previously typing a command that was wrong resulted in a confusing and annoying message, I updated this to be more helpful:

juju controller 
ERROR juju: "controller" is not a juju command. See "juju --help".

Did you mean:
	controllers

As you can see it helps the user by giving a way forward rather than a dead end.

A typical example of where we do provide dead ends is the follow:

juju models test
ERROR unrecognized args: ["test"]

Essentially I missed -c to identify the controller, I think we could be a lot better here.

juju models test
ERROR juju: "tests" is not a valid argument. See "juju models --help".

Did you mean:
        models -c test

Although a very contrived example, it does prove a point. If we want users (operators) to love juju we have to help them when learning or making mistakes. That way juju becomes less a ball of cryptic commands and becomes more human.

timClicks · 8 November 2019 01:19

One thing that I really love about Rust is that every error is coded. Those codes refer to a doc that explains situations where the error is encountered and often how provides a suggestion or two about how to fix things.

I don’t know how practical it would be to retrofit some of that into Juju… but it might work well in the charms/hooks space.

timClicks · 8 November 2019 01:22

My fav example of that is this sequence that was (is?) triggered before bootstrapping anywhere:

$ juju status
ERROR No selected controller.

Please use "juju switch" to select a controller.

$ juju switch
ERROR no currently specified model
$ juju add-model
ERROR model name is required
$ juju add-model testing
ERROR No selected controller.

Please use "juju switch" to select a controller.

rick_h · 8 November 2019 12:21

That feels like a bug to be filed where switch would also detect the “no selected controller” situation and giving you an error showing you a list of controllers. Good example.

simonrichardson · 8 November 2019 12:28

I’ve made a start, cleaning up when we don’t supply a model name. Spending some dedicated time on this might be worth while in the future.

https://github.com/juju/juju/pull/10885

simonrichardson · 8 November 2019 14:30

This just landing in develop

timClicks · 27 November 2019 19:48

Any thoughts on this one?

$ juju relate hello-juju postgresql
ERROR ambiguous relation: "hello-juju postgresql" could refer to "hello-juju:db postgresql:db"; "hello-juju:db postgresql:db-admin"

Perhaps we could suggest referring to the documentation of the postgresql charm about the differences. If people are encountering this error, it’s likely that they don’t have a good understanding of what the relation and its interfaces/endpoints are.

zicklag · 27 November 2019 20:10

You could go super verbose like a Rust error message:

$ juju relate hello-juju postgresql
Error ambiguous relation:

Attempting to connect hello-juju:db to postgresql,
but there are 2 pgsql interfaces on the postgresql application:

- postgresql:db
- postgresql:db-admin

Please specify a specific connection to make. For example:

> juju relate hello-juju:db postgresql:db

It does seem like Rust’s error messages are the best example of a way to give deeper context to users that don’t have a deep understanding of what they are doing. rustc --explain is pretty useful.

timClicks · 28 March 2020 21:03

I happened to read a blog post today about a new static analysis flag in GCC 10 and noticed that error tags in the output can be clicked on and then users can be sent to a web page for a detailed explanation.

It turns out that many terminal editors support hypertext:

Most of the terminal emulators auto-detect when a URL appears onscreen and allow to conveniently open them (e.g. via Ctrl+click or Cmd+click, or the right click menu).

It was, however, not possible until now for arbitrary text to point to URLs, just as on webpages.

In spring 2017, GNOME Terminal and iTerm2 have changed this.

GNOME Terminal is based on the VTE widget, and almost all of this work went to VTE . As such, we expect other VTE -based terminal emulators to catch up and add support really soon. Other terminal emulators are also welcome and encouraged to join!

From https://gist.github.com/egmontkob/eb114294efbcd5adb1944c9f3cb5feda

Perhaps we could make use of this feature in our output?

timClicks · 2 April 2020 19:44

Some feedback from @sssler-scania relating to this area:

Error messages that comes from juju are confusing for end-users and provides seldom any assistance to how to deal with them. Its a problem, since it affects the impression of juju in general. Presently, only introduce juju to hard-core devops that can manage this situation.

zicklag · 2 April 2020 22:02

I can sympathize with that. While Juju provides a way to make apps work together and automate themselves in a way that is absolutely amazing when it works, it quickly seems to be a hostile environment for debugging things when they fail. I can take a somewhat in-depth knowledge of Juju to fix things when they fail.

Charms are essentially a way to write an orchestrator for your app and orchestrators are difficult by nature to understand and debug, so its not all Juju’s fault or anything like that. Any improvement we can find out how to make will help.

jugmac00 · 9 August 2023 13:26

zicklag:

You could go super verbose like a Rust error message:
$ juju relate hello-juju postgresql
Error ambiguous relation:

Attempting to connect hello-juju:db to postgresql,
but there are 2 pgsql interfaces on the postgresql application:

- postgresql:db
- postgresql:db-admin

Please specify a specific connection to make. For example:

> juju relate hello-juju:db postgresql:db
It does seem like Rust’s error messages are the best example of a way to give deeper context to users that don’t have a deep understanding of what they are doing. rustc --explain is pretty useful.

@zicklag This is not “super verbose” - this is how it ought to be.

I am a new user who stumbled upon the same problem, and luckily this discussion was the 2nd result in a google search.

I would love error messages which

clearly state why there is an error
some background information about the error
a way for the user to recover (if possible)