Lessons from the Cloud: Analyzing Google's Recent Outage and Its Implications Podcast Por  arte de portada

Lessons from the Cloud: Analyzing Google's Recent Outage and Its Implications

Lessons from the Cloud: Analyzing Google's Recent Outage and Its Implications

Escúchala gratis

Ver detalles del espectáculo

Acerca de esta escucha

Hello and welcome to the Cloud Minute. Last week, Google Cloud suffered a three-hour outage that left customers unable to access their rented infrastructure. At the heart of the problem was a Service Control update rolled out on May 29 without a feature flag or proper error handling. When a policy change on June 12 introduced “unintended blank fields,” a “null pointer caused the binary to crash,” triggering a global crash loop.

Google’s Site Reliability Engineering team spotted the issue within two minutes, identified the root cause in ten, and began recovery in forty—but larger regions stayed down longer as overloaded systems struggled to restart. Among those hit was Cloudflare, whose services wobbled in turn.

In its incident report, Google pledged, “We will improve our external communications so our customers get the information they need asap,” and to ensure monitoring remains up even during outages. Once again, Google promises to learn from its mistakes—admitting it still “can’t avoid big outages.”
Link to Article

adbl_web_global_use_to_activate_webcro805_stickypopup
Todavía no hay opiniones