The most important factor in maintaining reliability of a cloud service is monitoring. Code quality, redundancy, and other factors are also important, but a good monitoring system will prevent some outages and minimize the impact of others. This is a guiding principle behind my work on Unread Cloud, and on the Webpage Text API that powers Unread’s webpage text capabilities.

There are two parts to monitoring: verifying that systems are running as expected, and generating appropriate notifications when something is amiss.

Verification

I have a Monitor application with a process that does an exhaustive health check on my Authentication system, on Unread Cloud, and on the Webpage Text API. Checks include:

  • That disk space, memory usage, and load on all instances are within acceptable thresholds.
  • That no TLS/SSL certificate is nearing expiration.
  • That all DNS records are in-place and configured as expected.
  • That database backups are happening as expected.
  • That all instances have the correct date and time.
  • That automated security updates have been applied as expected.
  • That retrieving webpage text for a not-yet-known webpage URL works as expected the first time, and is returned from cache when requested again.
  • That creating a new Unread Cloud account, subscribing to a feed with a URL not yet seen by Unread Cloud, getting new articles for that feed, saving an article, marking an article as read, and then deleting that account works as expected.
  • That all Unread Cloud feeds have been polled as recently as expected.
  • That any failures to poll feeds appear to be a problem with the feed publisher, not the result of an issue with Unread Cloud.
  • That the percentage of feeds where polling fails is within certain thresholds. There will be some feeds that are inaccessible because of issues with or changes to websites publishing those feeds, but if polling most feeds is failing then something else is wrong.

The Monitor application also has a web component with two API endpoints: one indicating whether there is an urgent problem demanding my immediate attention, and another indicating whether there is an issue that should be investigated within the next day or two.

Notification

I use Pingdom for notification of server issues. If Monitor is reporting an urgent issue, I get a text message. If Monitor is reporting an issue that should be investigated within the next day or two, I get an email.

Since Monitor does the work of detecting problems, I can easily switch to a different service for notifications if necessary.

I have been using this system to monitor the Webpage Text API since before releasing Unread 2 in February 2020. I added verification of Unread Cloud functionality in late 2021. This system is serving me well, and helps ensure that any issues come to my attention right away.

If you have not yet tried Unread 3, you can download it from the App Store.