For still unknown reasons the CPU load on the central backend went to 100% at 16:10 UTC. As a result the web server/API became unresponsive. Pinging the server was still possible though. SSH access was unsuccessful due to the CPU load. Using the cloud providers direct console access was unsuccessful as well as the terminal was completely unresponsive. So far some kind of kernel problem or hardware issue seems most likely as the reason. The access log looked completely normal up to that point, so the issue wasn’t triggered from outside somehow.
The box was therefore shut down forcefully. This took about 2 minutes as the system first tried a graceful shutdown which didn’t work. We were about to announce a scheduled download in the next 2-3 weeks for hardware resizing and took the chance to do the resize now. This took about 1 minute extra. After that the server was powered back on and came up without problems. The web server process was shut down temporarily to prevent user/API/device access. The primary database was then compared with the backup system to see if there was any data loss due to the hard shutdown. No data was lost, so after a few minutes the web server was switched back on. Immediately the devices slowly reconnected.
(Intermission) Devices have a permanent websocket connection to the backend service. If this connection gets interrupted (like in this case), the devices wait for a random time (up to 60 seconds) before they try to reconnect. If this fails, it increases the wait time between further connection attempts up to a maximum of 6 minutes between attempts. So after the service was back online it took up to 6 minutes before all devices reconnected.
I’m not entirely sure if the root cause for the outage can be discovered at this point. There was no kernel error message in the last log lines and everything looked perfectly normal until the server suddenly became unresponsive. So it seems there is not a lot to learn from this to prevent this kind of issue in the future.
On the positive side, the new bigger box will make things a bit smoother in the future.
Sorry for the downtime. Feel free to post here if you have any questions.