[Discuss] automatic daemon restarts

Mon Sep 15 16:49:35 EDT 2014

On 9/15/2014 4:15 PM, Tom Metro wrote:
> Not to say your points are invalid, but Netflix would disagree with you.
> They created a testing tool that intentionally kills random services on
> their production systems just to test that automated recovery works
> correctly.

Netflix is a highly available application system that is designed to be
robust in the face of isolated faults and to degrade gracefully under
failure conditions. Chaos Monkey is the tool that they use to test the
implementations of their designs. It works by shutting down random
Netflix-owned instances within the AWS scalable architecture. Automated
recovery in the Netflix environment is simple: spin up a new instance
that is configured identically to the one that failed. They don't try to
restart the faulted instance. It's down for the count and it stays that
way so they can analyze the fault that knocked it out.

This is a /very/ different scenario from what you might have with a
single LAMP instance where systemd keeps restarting MySQL after a
persistent fault of some sort keeps knocking it out. This isn't
automated recovery; it's an automated disaster looking to wreck your tables.

-- 
Rich P.