Kafka Tier 3 Support(Platform & Operations)
  • Siri Info Solutions
5 Hours Ago
NA
W2, C2C
Canton-MA
10-30 Years
Required Skills: kafka, optimization, scalability, reliability , RCA
Job Description
Position:    Kafka Tier 3 Support(Platform & Operations)
Location:   Canton, MA(Onsite)
Duration:  Contract
 
Job Description:
1. Tier 3 Incident Management & Escalation Support
Act as the highest technical escalation point for Kafka production incidents (Sev1 / Sev2).
Lead deep troubleshooting across:
Broker instability, controller elections, ISR shrinkage
Underreplicated partitions and leader imbalance
Producer/consumer failures, lag spikes, and rebalance storms
Disk, network, JVM, and request handler saturation
Provide handson remediation for complex issues, including:
Partition reassignment and leader rebalance
Broker configuration tuning
Throttle/quota strategies for noisy producers or consumers
Coordinate with vendor support during service incidents, providing logs, metrics, and forensic details.
Guide Tier2 teams during major incidents and validate restoration actions.
 
2. Kafka Performance Engineering & Optimization
Analyze Kafka workloads for performance and scalability risks:
Partition skew and hot partitions
Inefficient producer batching/compression
Consumer lag root cause analysis
Thread pool, I/O, and network bottlenecks
Recommend and validate:
Topic design (partition count, replication factor, retention, compaction)
Producer and consumer configuration best practices
Quotas, quotas enforcement, and multitenant controls
Support onboarding of highthroughput or latencysensitive workloads, ensuring Kafka is correctly sized and tuned.
 
3. Platform Stability, Reliability & Resilience
Diagnose and resolve systemic Kafka stability issues:
Repeated broker failures or flapping
Metadata/controller instability (Zookeeper or KRaft)
Recovery issues following failovers or maintenance events
Support resilience initiatives:
MultiAZ cluster health validation
Replication and DR strategies (MirrorMaker 2, Replicator, or applevel DR patterns)
Failover testing and validation
Define and improve Kafka SLOs for availability, durability, and latency.
 
4. Change, Upgrade & Configuration Leadership
Lead medium to highrisk Kafka changes, including:
Broker and cluster configuration changes
Partition expansion or largescale reassignment
Topic policy changes impacting durability or performance
Support and plan:
Kafka version upgrades
MSK / Confluent upgrade cycles
Client compatibility and rollout strategies
Participate in CAB reviews, assess risk, and design rollback and validation plans.
 
5. Root Cause Analysis & Continuous Improvement
Own RCA documentation for major incidents with clear corrective and preventive actions (CAPA).
Identify recurring failure patterns and architectural gaps.
Recommend platform-level improvements:
Automation opportunities
Guardrails and standards
Monitoring and alerting enhancements
Contribute to continuous improvement of runbooks, knowledge base articles, and operational playbooks.
 
6. Mentorship & Collaboration
Provide technical guidance and mentoring to Tier2 Kafka support teams.
Collaborate with:
Application teams on Kafka client usage and best practices
Platform and SRE teams on capacity planning and reliability engineering
Security teams on access control, encryption, and compliance requirements
Act as a subject matter expert for Kafka within the organization.

Jobseeker

Looking For Job?
Search Jobs

Recruiter

Are You Recruiting?
Search Candidates