Excessive growth of delta

matuskosut · 8 October 2018 09:50

Hey guys, we’re hitting some python-libjuju/juju issues when testing our automated openstack charms deployment and upgrade.

Python-libjuju is receiving deltas from controller to keep up with the actual state of the model. The first one is received during connection. Since delta is sent the same way as all the commands - over rpc message, it falls under the limit of rpc message.

The problem appears with higher usage of juju actions, especially when the output of these actions is considerably longer and the amount of units that are targeted is huge (e.g. openstack deploy). Output of actions is stored in this delta, for all these units and sent within every delta. Once delta overgrows the limit libjuju fails to connect or if connected continues with inconsistent state because attempts to receive delta are failing.

In our case we are using 64MB limit (as compared to default 4MB) and we are still able to hit it within a week, if we start with a clean model. We suggest that at least the output of actions is lazy loaded on demand, to preserve the reliability of the client. This may require changes both in python-libjuju and juju itself.

One of the errors we observed when hitting the issue during cleanup of the model was already reported, but not yet followed: https://github.com/juju/python-libjuju/issues/179

We have noticed this idea to use lazy-loading, and having more fine grained watchers could maybe help with this: https://github.com/juju/python-libjuju/issues/181

jameinel · 8 October 2018 10:51

Is this failing reading the contents of the AllWatcher? It does seem reasonable to only include the fact that actions exist in the stream, rather than the full content of the result of the action. That said, if you’re actually asking about actions explicitly, then it makes more sense to have that available.

I know we also talked about possibly having actions be able to return a “file” that would be a better way to handle large content responses. Those could certainly be exempted by default and would be retrieved via a more appropriate (chunked/streamed) API.

jameinel · 8 October 2018 10:52

I should also mention that in recent releases (I thought it was 2.3.* series), @externalreality implemented support for pruning old actions. So if there are some specific action results that are a problem, you should be able to remove them.

matuskosut · 9 October 2018 16:36

Yes it is the AllWatcher request that is too big. It is easiest to catch when connecting, there I kind of understand it happens in websockets module where the size limit is set and detected. There it throws websockets.ConnectionClosed which is then handled in connection.py with message: RPC: Connection closed, reconnecting and re-attempted 3 times. I have more difficulties tracing it when happens during active connection.

matuskosut · 9 October 2018 16:38

Pruning seems to be more of a hotfix than a solution, at least in my case, but thanks for mentioning, will try to adjust it.