Required Skills: Java, Production Support SRE, Python scripting for automation, Splunk, Grafana, Prometheus, application performance monitoring, infrastructure observability concepts
Job Description
Location: Phoenix, AZ
Role: Developer (Site Reliability Engineering)
Experience: 6 – 8 Years
TCS
Position Overview:
We are seeking a highly skilled Java Site Reliability Engineer (SRE) with strong expertise in Java application development and hands-on experience in production support. The ideal candidate will possess the ability to perform code-level debugging and fixes, automate reliability tasks, and ensure optimal system performance and availability through proactive monitoring and alerting.
Key Responsibilities:
-
Provide Production Support (SRE) for Java-based applications, ensuring high availability, reliability, and performance.
-
Perform code-level analysis and minor bug fixes, including code deployment activities.
-
Automate operational tasks using Python or other scripting languages.
-
Design and implement monitoring solutions by creating dashboards and alerts for production application health.
-
Set up log-based alerts (e.g., using Splunk) aligned with defined SLAs.
-
Develop application-level process health checks and integrate with alerting systems.
-
Build infrastructure and performance monitoring dashboards, covering:
-
Database query performance
-
API call performance (throughput, latency, process rate)
-
System resource utilization and trends
Required Skills:
-
Strong experience in Java and Production Support SRE roles.
-
Proficiency in Python scripting for automation.
-
Hands-on experience with monitoring and alerting tools (e.g., Splunk, Grafana, Prometheus, or similar).
-
Strong understanding of application performance monitoring (APM) and infrastructure observability concepts.
-
Excellent problem-solving skills and ability to diagnose and resolve complex production issues.