
Lessons from the Cloud: Analyzing Google's Recent Outage and Its Implications
No se pudo agregar al carrito
Add to Cart failed.
Error al Agregar a Lista de Deseos.
Error al eliminar de la lista de deseos.
Error al añadir a tu biblioteca
Error al seguir el podcast
Error al dejar de seguir el podcast
-
Narrado por:
-
De:
Acerca de esta escucha
Hello and welcome to the Cloud Minute. Last week, Google Cloud suffered a three-hour outage that left customers unable to access their rented infrastructure. At the heart of the problem was a Service Control update rolled out on May 29 without a feature flag or proper error handling. When a policy change on June 12 introduced “unintended blank fields,” a “null pointer caused the binary to crash,” triggering a global crash loop.
Google’s Site Reliability Engineering team spotted the issue within two minutes, identified the root cause in ten, and began recovery in forty—but larger regions stayed down longer as overloaded systems struggled to restart. Among those hit was Cloudflare, whose services wobbled in turn.
In its incident report, Google pledged, “We will improve our external communications so our customers get the information they need asap,” and to ensure monitoring remains up even during outages. Once again, Google promises to learn from its mistakes—admitting it still “can’t avoid big outages.”
Link to Article