Why do computers stop and what can be done about it?

I have no formation in computer science and will forever be catching up on reading classic papers from the field. Bill de hÓra posted a reading list on his blog and I'm taking advantage of it. I've read a few of the papers ("The Law of Leaky Abstractions", "As We May Think") and seen references to others, but some are completely new to me. "Why do computers stop and what can be done about it?" caught my eye. What a title. Jim Gray nailed it.

The paper outlines a playbook for highly available computer software systems in general terms. Gray wrote it for all software engineers, not only for users of a particular language or product. I didn't know that concepts like failing fast, shared nothing, heisenbugs, and fail over were established 30 years ago. I found the description of them from the time that they were fresh fascinating. Gray wrote this paper in a simple and straightforward style and I'm grateful for that. I've read a number of AWS product sheets this week and they are opaque in comparison. Do AWS Glue or Step Functions use strategies from Gray's playbook? Do they operate on different principles? It's difficult to tell.

I remember well the search for Jim Gray when he went missing at sea in 2007. I didn't plan to write a blog post about one of his papers on the 12th anniversary of his disappearance, but that's what has happened. Read the paper if you haven't, I almost guarantee that you'll find at least a few interesting insights and rules of thumb.