PayUHub high error rate and latency due to AWS incident in AZ EU-CENTRAL-1c

Incident Report for PayU Hub

Postmortem

Incident Time (UTC):

07:39 – 08:50 – High error rates and latency in all APIs. Peaks up to 1 min in 0.5 percentile. Latency of 1 min in 0.9 percentile. Success rate drop for 2 min to 0.77
09:50 – 10:33 – Peaks of latency in Payments API in the 0.9 percentile of up to 1 min. Success rate 100%
10:33 - 10:55 - Success rate drops to 98% in Payments API and latency peaks in Tokenization API. Success rate drop is related to specific integrations (PayUPoland, Safecharge, Daleynes, PayuCitrus ,Worldpay, RSB, Sberbank, Wirecard) and/or API version 1.3
10:55 - 12:40 - Success rate drops to 98% in Payments API and latency peaks in Tokenization API. Most integrations recover except for Safecharge, Daleynes, PayuCitrus and/or API version 1.3.

Incident Details (step by step):

07:39 – 08:28:

Zooz on-call team gets alerts and detects issues with services originating from some DC workers and Cassandra nodes down.
eu-central-1c AZ is marked as the zone from which all problematic services and nodes originated.
Official recognition from AWS regarding the outage.
After considering stopping problematic nodes, clarification received in a discussion with AWS support claims that stopping instances isn’t currently feasible due to the AZ outage.
Latency issue is seen in the system due to timeouts to Cassandra as the nodes were flapping and could not be stopped (also due to AWS outage).
All Cassandra clusters self recovered

08:29 - 09:50:

System returns to normal latency and 100% success rate.

09:50 – 10:33:

One Cassandra node in one of clusters is flapping.
Latency issues start again in the system.
Zooz team stops the node as the system should be able to operate properly with one node down.
Latency continues due to a bug in our API gateway that continues trying to reach the stopped node.
The latency was partial and did not affect the majority of API calls. Success rate remains 100%.

10:33 - 10:55:

To decrease latency ZOOZ team starts the problematic node. Latency is back to normal.
Notification received from AWS about issues with EBS volumes in the same AZ - “Some EBS volumes are experiencing degraded performance in a single Availability Zone in the EU-CENTRAL-1 Region. We are working to resolve this issue. Network connectivity errors and elevated API error rates in the Availability Zone have been resolved”
After the Cassandra node joined again ZOOZ team begins to see permission errors from various services due to corruption of sys_auth keyspace.
Success rate drops to 98%.
`specific services in the system experience full/partial outage:
- PayUPoland integration - partial
- SafeCharge - full
- Payments API version 1.3 - partial
- Dalenys - partial
- PayUCitrus - full
- WorldPay - partial
- RSB - partial
- Sberbank - partial
- Wirecard - partial

10:55 – 12:40:

To handle the corruption as soon as possible the problematic node is removed and the team starts working on provisioning a replacement node.
Spikes of latency appear in the system again in the 0.9 percentile
Some services remain in partial/full outage:
- SafeCharge - full
- Payments API version 1.3 - partial
- Dalenys - partial
- PayUCitrus - full
Cassandra node preparation took more time than expected due to the complexity of the terraform and Ansible playbook failures.

12:40:

New Cassandra node replaces the old damaged one.
A repair of the sys_auth table is done.
System returns to normal.

14:23 - Resolution notification from AWS that all issues in the problematic AZ are resolved

Damage Assessment:

Partial outage and drop in success rate of the system.
High latency in the 0.9 percentile.
4000 5XX errors were returned from the system during the whole time period.

Incident investigation and findings:

Our strategy to spread across AZs proved itself in our new DC. We need to speedup migration.
Our Cassandra automation is not good enough and involves manual intervention; it takes far too long to provision a new node.
Our API gateway does not handle well Cassandra node down compared with other services in the system.

Action Plan:

Continue with accelerated migration of the system to our new DC with the strategy of spreading instances across AZs - in process.
Improve Cassandra node replacement - reduce time to 5 min - in process.
Upgrade our API gateway - the newly released version includes a fix to the node refresh mechanism - in process.
Add chaos tests for full AZ failure to be able to turn off an AZ if a similar incident in AWS would happen again - planned for Jan

Posted Nov 17, 2019 - 10:48 UTC

Resolved

Due to failures in network connectivity and degraded EBS volume performance in a single Availability Zone in the EU-CENTRAL-1 Region, our system experienced partial outage and high latency

As AWS reported in their status page:
“Between November 11 11:38 PM and November 12 12:48 AM PST we experienced API errors, network connectivity errors, and degraded EBS volume performance in a single Availability Zone in the EU-CENTRAL-1 Region. The network connectivity errors were resolved at 12:48 AM PST, the API errors were resolved at 1:26 AM PST. We have recovered the majority of the degraded volumes. A small number of volumes remain degraded. We continue to work to recover all affected volumes and will notify customers.”

Our platforms are designed to be fully redundant to such failures. We are currently in the process of migration to our new DCs that spread evenly among AZ’s. Faults were discovered in one of our services as well as in our ability to dispose of cassandra nodes quickly in the problematic AZ. We are already in the process of fixing these faults and will also conduct chaos tests regularly for such failures.

Posted Nov 12, 2019 - 12:40 UTC