Required Skills: Production, Support, AWS, DYNATRACE, ServiceNow, Splunk,
Job Description
Role: Production Support
Location: Newark, NJ Hybrid
Duration: 12 months
Notes: The manager has provided some insight into some of the questions that have been asked pertaining to this request found below:
- Is this position comparable to an app support lead role?
No – this is a production support role
- Which technologies or tech stack are essential for this position?
This is not an active development role, however, having foundational knowledge of how tech infrastructure and utilities work is essential. For example, Dynatrace, building a splunk dashboard, understanding which metrics may be essential for alerting 4XX vs 5XX errors, understanding how asynchronous infra like Kafka, Kinesis works, being familiar with troubleshooting logs for services and lambda functions, being able to navigate AWS Dashboard for troubleshooting.
- What is the percentage split between hands-on work and leadership responsibilities in this role?
80/20
Job Description:
Key Responsibilities:
Incident Management and Resolution:
- Oversee the triage, investigation and resolution of production issues, ensuring timely communication and status updates
- Manage incident response efforts, including documentation and root cause analysis and post-incident reviews to identify preventative actions
- Establish clear escalation protocols and ensure adherence to serve level agreements (SLAs)
- Coordinate resolution and follow ups with dependencies outside immediate team
- Coordinate KTs between development teams and L1/L2 triage to establish runbooks and knowledge base
Team Leadership and Coordination:
- Coordinate with development, QA, and infrastructure teams to ensure seamless issue resolution and knowledge sharing
- Foster a strong ownership mindset within the team, ensuring accountability for system health and stability
Monitoring and Alerting
- Define and maintain effective monitoring solutions in partnership with development teams to proactively identify and address potential issues
- Continuously improve observability by implementing dashboards, alerts and automated health checks in partnership with development teams
Process and Documentation
- Develop and maintain detailed runbooks, SOPs and knowledge base articles to ensure consistent response procedures
- Establish best practices for incident response, including communication templates and decision frameworks
Stakeholder Communication:
- Serve as the primary point of contact for production issues affecting client experiences
- Provide clear, concise updates to leadership, internal teams and clients during incidents and post-incident reviews.
Continuous Improvement
- Identify patterns in recurring incidents and partner with development teams to implement permanent fixes
- Drive initiatives to enhance system reliability, scalability, and performance.
Qualifications and Skills:
- Proven experience in a production support leadership role for client facing applications
- Strong understanding of incident management frameworks
- Proficiency in troubleshooting application, database, and infrastructure issues
- Familiarity with monitoring tools such Dynatrace, Datadog , Splunk etc
- Familiarity with incident management platforms such as ServiceNow
- Ability to prioritize tasks effectively, and communicate technical concepts to non technical stakehodlers
- Excellent problem solving skills and a calm, solution-focused approach under pressure
- Experience working in AWS
- Familiarity with CI/CD pipelines and release management processes
Preferred:
- Background in software development or scripting for automation
- Previous experience in the financial services industry
Success Metrics
- MTTA: Mean time to acknowledge
- MTTR: Mean time to resolve
- Stakeholder satisfaction with incident communication
- Knowledge base usage rate and coverage
- Number of issues handed over to L1/L2, EMKT teams
- Measure # of system identified vs user reported alerts and trends over time
- Enhancements and alerts requested
- Minimize # of user reported incidents
- Measure incidents resolved with L1/L2 without app support team
- Reduction in resolution times due to documented processes