Closed
Bug 1335969
Opened 9 years ago
Closed 9 years ago
Rebuild new data pipeline ingestion infrastructure due to kafka failure
Categories
(Cloud Services Graveyard :: Metrics: Pipeline, defect, P1)
Cloud Services Graveyard
Metrics: Pipeline
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: whd, Assigned: whd)
References
Details
(Whiteboard: [SvcOps])
On 01/31/2017 a kafka node in the production environment failed, and a new node was automatically spun up to replace it. This happens every few months due to the nature of the cloud, and Kafka's automatic fail-over worked as expected and no disruption in data processing occurred.
Today I went to reassign the partitions for various topics to include the replacement node, starting with telemetry.error (our smallest topic). The reassignment succeeded in the sense that kafka considered it to be completed, but it completed suspiciously quickly and the topic remained under-replicated. I did another partition reassignment of the same topic to see if shuffling it around would work, but it appears to have put kafka into an inconsistent state, where some leaders are refusing to allow replication requests from bad followers because (presumably) those followers are using a different partition assignment. The closest I've found to someone else experiencing this issue is
https://www.mail-archive.com/users@kafka.apache.org/msg23838.html
but there was no resolution.
Kafka doesn't really support cancelling partition reassignments (https://issues.apache.org/jira/browse/KAFKA-1676). Theoretically I could abort the reassignment by deleting things from zookeeper manually, but rather than attempt any more janky recovery procedure, I am going to rebuild a parallel stack.
The only piece that needs access to both stacks in parallel is the CEP, which I'll make sure is configured temporarily to read from both kafkas. Since the edge is behind the tee, dns propagation is not an issue and once I repoint edge DNS all data will be diverted to the parallel stack.
I'm marking this as a blocker for bug #1331880 because I don't want to mess with that until prod kafka is 100% stable.
Assignee | ||
Comment 1•9 years ago
|
||
This was completed without incident.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Updated•7 years ago
|
Product: Cloud Services → Cloud Services Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•