Monitoring for Operational Outcomes Sansa Bailish is focused on outages, reliability, and getting better sleep. There is a known issue with image trends; when the instances are rebooted the application doesn’t restart. Too frequently Sansa has received a late night call to get online and restart the application. The user experience lead is very frustrated with the downtime associated to these incidents. They need to be detected sooner and resolved faster.
Sansa is looking for a monitoring solution to detect the incident (the reboot event), and a way to trigger an automated recovery.
Create a Runbook to Resolve the Error
6.1 Resolve the error manually
6.2 Automate Responses using Lambda
6.3 Validate - Ensure the response has the desired outcome
6.4: Bonus Content: Review the outcomes in CloudWatch
What have we achieved? - We have answered three monitoring needs, one for each of our teams. - We have provided insight to the quality of the end user experience by creating a image tag confidence metric for the business team. - We have provided insight to issues and potential development needs by providing the development team (DevOps) with error and warning metrics. - We have provided a mechanism for the operations team (SRE) to use an event trigger to automatically remediate an outage causing issue in the environment.
How did we create an automatic remediation? - Knowing the log entry that indicates that the application has failed and is in a state from which it can be recovered we created a metric. - Using that metric we created an alarm that notifies via an SNS topic. - We have a Lambda function that is subscribed to that SNS topic that creates a CloudWatch Event in response. - We created a CloudWatch rule that triggers on our Lambda initiated CloudWatch event and invokes run command. - Our run command invocation uses a command document we created to execute the start up script on our server restoring it to an operating state.
We have created something of a Rube Goldberg machine to achieve that outcome. In doing so we have demonstrated the use of logs and how to get more value out of them, the use of events, and the triggering of actions in response to events; all enabled by CloudWatch.