PayUHub full outage July 21
Incident Report for PayU Hub
Postmortem

Incident Report - PayUHub outage - July 21 2020

This report is submitted for your information to describe the outage experienced in the PayUHub on the 21th of July 2020.

The cause of the incident was a failure in our database in three of its nodes sequentially in different availability zones.

Our databases are resilient with no single point of failure. They can sustain a node down without any issues or more if in the same AWS AZ. With three nodes down we had to fix them carefully one by one and join them to the cluster. This caused the incident to last 2.5 hours.

We have already taken measures to ensure this issue and similar ones do not happen again.

Incident Time (UTC)):

04:24 - 07:08  – Three of our primary database nodes failed gradually and sequecialy one by one due to a combination of a background process of the database and external data that entered the system through the API. These three nodes are in three different AWS AZs. This caused drops in success rate and full down time of the system.  The recovery of the nodes done by the on-call team was slow due to the sensitivity and nature of the issue.

Incident Details (step by step):

  • 04:21 - 04:32  – latency is spotted in the external api. Zooz oncall team starts investigation
  • 04:32 - First node is flapping. Error rate increases. Zooz oncall team starts the automated procedure of node replacement for the database flapping node. 
  • 04:35 - 05:01 - Second node starts flapping. Success rate drops to 60%. After rebooting the node with no success - Zooz on-call team decides to do an emergency more risky, short procedure for replacement since the regular procedure takes a few hours.
    This takes more time than expected.
  • 05:28 - 05:39 - third node starts flapping - success rate drops to 20%. Since success rate is very low, the system is loaded with retries and at risk of partial data inconsistency the zooz oncall team shuts down API gateways - blocking external API calls.
  • 06:30 - 06:34 - third node is recovered quickly by a different even quicker process tried for the first time.
  • 06:40 - 07:03 - Second node changes reverted in order to try the quickest procedure with success. Second node is back up
  • 07:05 - 07:08 - While the first node still continues the replacement in the background, api gateways are brought back up and success rate is again 100%. 

Damage Assessment:

  • 2.5 hours of full outage.
  • 454561 5XX API responses (including retries - all endpoints).
  • 52073 5XX API POST requests (including retries - all endpoints)

Incident investigation and findings:

  • We had missing alerts that could have discovered an issue before it affects the API
  • We were missing in a specific part of the system protection from extreme large datasets input validations. 
  • Time to replace unhealthy DB node was slow

Action Plan:

  • Add an additional cross system data validation layer to protect from data related vulnerabilities and prevent large datasets to enter the system - Done.
  • Add the missing alerts in the system - Done
  • Automate the a fast node recovery procedure to replace damaged nodes quicker - In Progress
Posted Jul 27, 2020 - 18:30 UTC

Resolved
This incident has been resolved.
Posted Jul 21, 2020 - 08:15 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
The live environment is back to fully functional mode.
The test environment is currently not accepting connections.
Posted Jul 21, 2020 - 07:10 UTC
Update
We are continuing to work on a fix for this issue.
Posted Jul 21, 2020 - 06:48 UTC
Update
We are continuing to work on a fix for this issue.
Posted Jul 21, 2020 - 05:33 UTC
Update
We are continuing to work on a fix for this issue.
Posted Jul 21, 2020 - 04:58 UTC
Identified
The issue has been identified and a fix is being implemented.
Posted Jul 21, 2020 - 04:41 UTC
Investigating
We are currently investigating this issue.
Posted Jul 21, 2020 - 04:41 UTC
This incident affected: Payments Processing (Tokens API, Customers API, Payments API) and Configuration APIs.