<?xml version="1.0" encoding="utf-8"?>
<?xml-stylesheet type="text/xsl" href="../assets/xml/rss.xsl" media="all"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Sean Gillies (Posts about reliability)</title><link>https://sgillies.net/</link><description></description><atom:link href="https://sgillies.net/tags/reliability.xml" rel="self" type="application/rss+xml"></atom:link><language>en</language><lastBuildDate>Sun, 31 Dec 2023 01:26:24 GMT</lastBuildDate><generator>Nikola (getnikola.com)</generator><docs>http://blogs.law.harvard.edu/tech/rss</docs><item><title>Why do computers stop and what can be done about it?</title><link>https://sgillies.net/2019/01/29/why-do-computers-stop-and-what-can-be-done-about-it.html</link><dc:creator>Sean Gillies</dc:creator><description>&lt;p&gt;I have no formation in computer science and will forever be catching up on
reading classic papers from the field.  Bill de hÓra posted a reading list on
his blog and I'm taking advantage of &lt;a class="reference external" href="https://dehora.net/journal/paper-reading"&gt;it&lt;/a&gt;. I've read a few of the papers
("The Law of Leaky Abstractions", "As We May Think") and
seen references to others, but some are completely new to me. &lt;a class="reference external" href="http://www.hpl.hp.com/techreports/tandem/TR-85.7.pdf"&gt;"Why do
computers stop and what can be done about it?"&lt;/a&gt; caught my eye. What
a title. &lt;a class="reference external" href="https://en.wikipedia.org/wiki/Jim_Gray_(computer_scientist)"&gt;Jim Gray&lt;/a&gt; nailed it.&lt;/p&gt;
&lt;p&gt;The paper outlines a playbook for highly available computer software systems in
general terms. Gray wrote it for all software engineers, not only for users of
a particular language or product. I didn't know that concepts like failing
fast, shared nothing, heisenbugs, and fail over were established 30 years ago.
I found the description of them from the time that they were fresh fascinating.
Gray wrote this paper in a simple and straightforward style and I'm grateful
for that. I've read a number of AWS product sheets this week and they are
opaque in comparison. Do AWS Glue or Step Functions use strategies from Gray's
playbook?  Do they operate on different principles? It's difficult to tell.&lt;/p&gt;
&lt;p&gt;I remember well the search for Jim Gray when he went missing at sea in 2007.
I didn't plan to write a blog post about one of his papers on the 12th
anniversary of his disappearance, but that's what has happened. Read the paper
if you haven't, I almost guarantee that you'll find at least a few interesting
insights and rules of thumb.&lt;/p&gt;</description><category>availability</category><category>computing</category><category>reading</category><category>reliability</category><category>transactions</category><category>work</category><guid>https://sgillies.net/2019/01/29/why-do-computers-stop-and-what-can-be-done-about-it.html</guid><pubDate>Tue, 29 Jan 2019 02:44:16 GMT</pubDate></item></channel></rss>