Cloud Troubleshooting

by Douglas Bernardini

Those diagnosing a technical problem with cloud infrastructure are seeking possible explanations (hypotheses) and evidence that explains the problem. In the short term, they look for changes in the system that roughly correlate with the problem, and consider rolling back, as a first step to mitigate the problem and stop the bleeding. The longer-term goal is to identify and fix the root cause so the problem will not recur.

From the site reliability engineering (SRE) perspective, the general approach for troubleshooting is as follows:

  • Triage: Mitigate the impact if possible
  • Examine: Gather observations and share them
  • Diagnose: Create a hypothesis that explains the observations
  • Test and treat: Identify tests that may prove or disprove the hypothesis
  • Execute the tests and agree on the meaning of the result
  • Move on to the next hypothesis; repeat until solved

When you’re working with a cloud provider on troubleshooting an issue, there are parts of the process you’re unable to control. But you can follow the steps on your end. Here’s what you can do when submitting a report to your cloud provider support team.

1.Communicate any troubleshooting you’ve already done.

By the time you open an issue report, you’ve probably done some troubleshooting already. You may have checked the provider’s status page, for example. Share the steps you’ve taken and any key findings. Keep a timeline and log book of what you have done and share it with the provider. This means that you should start keeping a log book as soon as possible, from the start of detection of your problem. Keep in mind that while cloud providers may have telemetry that provides real-time omniscient awareness of the state of their infrastructure, the dependencies that result from your particular implementation may be less obvious. By design, your particular use of cloud resources is proprietary and private, so your troubleshooting vantage point is vital.

If you think you have a diagnosis, explain how you came to that conclusion. If you think others can reproduce the issue, include the steps to do so. A reproducible test in an issue report usually leads to the fastest resolution.

You may have an idea or guess about what’s causing the problem. Be careful to avoid confirmation bias—looking for evidence to support your guess without considering evidence to the contrary.

2. Be specific and explicit about the issue

If you’ve ever played the telephone game, in which players whisper a message from person to person, you’ve seen how human translation and interpretation can lead to communication gaps. Rather than describing information in your provider communications, try to share it. Doing so reduces the chance that your reader will misinterpret what you’re saying and can help speed up troubleshooting. Don’t assume that your provider has access to all of this information; customer privacy means that they may not, by design.

For example:

Use a screenshot to show exactly what you see
For web-based interfaces, provide a HAR (Http ARchive) file
Attach information like tcpdump output, logs snippets and example stack traces

3. Report production outages quickly

An issue is considered to be a production outage if your application has stopped serving traffic to users or is experiencing similar business-critical impact. Report production outages to your cloud provider support as soon as possible. Issues that block a small number of developers in a developer test environment are normally not considered production outages, so they should be reported at lower priorities.
Normally, when cloud provider support is alerted about a production outage, they quickly triage the situation with the following steps:

  • Immediately check for known issues affecting the infrastructure.
  • Confirm the nature of the issue.
  • Establish communication channels.

Typically, you can expect a quick response with a brief message, which might contain:

Whether or not there is a known issue affecting multiple customers
An acknowledgement that they can observe the issue you’ve reported or a request for more details

How they intend to communicate (for example, phone, Skype, or issue report)
It’s important to quickly create an issue report including the four critical details (described in part one and then begin deeper troubleshooting on your side of the equation. If your organization has a defined incident management process (see Managing Incidents), escalating to your cloud provider should be among your initial steps.

4. Report networking issues with specificity

Most cloud providers’ networks are huge and complex, composed of many technologies and teams. It’s important to quickly identify a networking-specific problem as such and engage with the team that can repair it.

Many networking issues have similar symptoms, like “can’t connect to server,” at a high level. This level of detail is typically too generic to be useful in identifying the root cause, so you need to provide more diagnostic information. Network issues relate to connectivity, which always involves at least two specific points: source and destination. Always include information about these points when reporting network issues.

5. Escalate when appropriate

If circumstances change, you may need to escalate the urgency of an issue so it receives attention quickly. Take this step if business impact increases, if an issue is stuck without progress after a lot of back-and-forth with support, or if some other factor calls for quicker resolution.

The most explicit way to escalate an issue is to change the priority of the issue report (for example, from P3 to P2). Provide comments about why you need to escalate so support can respond appropriately.

6. Create a summary document for long-running or difficult issues

Issue state and relevant information change over time as new facts come to light and hypotheses are ruled out. In the meantime, new people join the investigation. Help communicate relevant, up-to-date information by collecting information in a summary document.