Closed Bug 1335969 Opened 9 years ago Closed 9 years ago

Rebuild new data pipeline ingestion infrastructure due to kafka failure

Categories

(Cloud Services Graveyard :: Metrics: Pipeline, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: whd, Assigned: whd)

References

Details

(Whiteboard: [SvcOps])

On 01/31/2017 a kafka node in the production environment failed, and a new node was automatically spun up to replace it. This happens every few months due to the nature of the cloud, and Kafka's automatic fail-over worked as expected and no disruption in data processing occurred. Today I went to reassign the partitions for various topics to include the replacement node, starting with telemetry.error (our smallest topic). The reassignment succeeded in the sense that kafka considered it to be completed, but it completed suspiciously quickly and the topic remained under-replicated. I did another partition reassignment of the same topic to see if shuffling it around would work, but it appears to have put kafka into an inconsistent state, where some leaders are refusing to allow replication requests from bad followers because (presumably) those followers are using a different partition assignment. The closest I've found to someone else experiencing this issue is https://www.mail-archive.com/users@kafka.apache.org/msg23838.html but there was no resolution. Kafka doesn't really support cancelling partition reassignments (https://issues.apache.org/jira/browse/KAFKA-1676). Theoretically I could abort the reassignment by deleting things from zookeeper manually, but rather than attempt any more janky recovery procedure, I am going to rebuild a parallel stack. The only piece that needs access to both stacks in parallel is the CEP, which I'll make sure is configured temporarily to read from both kafkas. Since the edge is behind the tee, dns propagation is not an issue and once I repoint edge DNS all data will be diverted to the parallel stack. I'm marking this as a blocker for bug #1331880 because I don't want to mess with that until prod kafka is 100% stable.
This was completed without incident.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Product: Cloud Services → Cloud Services Graveyard
You need to log in before you can comment on or make changes to this bug.