Update 1
I will fill in the details here later but when I woke up, PianoTell was not responding. Did anyone else notice?
Update 2
For the first time since 6 February 2025, we had another unplanned outage on PianoTell, this 22 February 2026.
status.pianotell.com has no record of the incident as it appears the monitoring was paused.
BetterStack apparently sent me an email on 26 May 2025 that they had automatically paused monitoring because I hadn't signed in for 3 months. I've unpaused monitoring but I'm considering BetterStack a useless service now and I will consider something like srvup.io instead.
Based on local server logs, I estimate we were down from 02:15 to 09:15 Pacific Time. At about 7 hours, that is the longest PianoTell outage ever by far.
The other piece of bad news is that this is a repeat of a previous incident. You never want a new incident that is an exact repeat of a previous incident. The purpose of a retrospective is to identify repair actions and implement them so that such an incident never occurs again.
In the previous incident:
We apparently ran out of disk space. PianoTell creates backups every 15 minutes and uploads them to the cloud. Unfortunately, I never implemented logic to delete unneeded backups on the local disk.
In my defense, I did eventually get to implementing a clean up job so that no more than 5 days worth of backups are kept locally. In theory, this ensures that PianoTell has plenty of disk space which should have prevented a similar outage from reoccurring.
What happened this time is that cloud storage ran out of space, and of course, I don't have an automated cloud storage cleanup job. The sequence of failures was like this:
- Cloud storage ran out of space.
- PianoTell kept running local backups but failed to upload the backups to the cloud.
- Because the backups failed upload, the cleanup step never ran locally.
- Even though I had cleared cloud storage manually, it didn't matter because PianoTell had a massive backlog of backups to push to cloud and immediately filled it again.
- PianoTell eventually ran out of local storage as well.
- Boom.
To fix this properly I need to implement a job that will clean up cloud storage, but I intentionally want to keep more backups on the cloud than locally on PianoTell. I have some concrete ideas in mind here to improve the robustness of our backup system and will look into prioritizing this.
To summarize:
- Monitoring failed.
- Longest outage ever for PianoTell.
- Repeat of previous incident.
- Ouch.
Sorry for the downtime!