Site Reliability Engineer with Grafana
  • Code Beacons Inc.
11 Hours Ago
NA
NA
Remote
3-10 Years
Required Skills: SRE, DevOps, Systems Engineering, Linux & Shell Scripting, AWS, Azure, Kubernetes, ECS & Docker, Python, Java, CI/CD, OpenTelemetry, PostgreSQL, Ansible, Chef, Puppet
Job Description
We are seeking a highly skilled Site Reliability Engineer (SRE) to ensure the reliability, scalability, security, and performance of our production systems. This role bridges software development and operations—driving automation, monitoring, and engineering excellence. You will report directly to the Senior Director of Engineering.
What You’ll Do:
Reliability & Performance
Ensure high availability, scalability, and reliability of production systems
Define & manage SLIs, SLOs, and SLAs
Conduct capacity planning and performance optimization
Automation & Tooling
Automate infrastructure using Terraform, Terragrunt, Ansible
Build CI/CD pipelines for rapid, reliable deployments
Reduce manual operations through automation
Monitoring & Incident Response
Design & maintain monitoring, logging, and alerting (Datadog)
Participate in on-call rotations; lead incident response
Perform RCA and write postmortems to prevent recurrences
Systems Engineering
Manage cloud infrastructure (AWS, Azure)
Work with Kubernetes, ECS, Docker
Implement best practices for security, networking, and system resilience
Collaboration & Leadership
Partner with engineering teams to design reliable distributed systems
Advocate SRE best practices across the organization
Mentor engineers on tooling, automation, and reliability
What You’ll Need:
Bachelor’s in CS, Engineering, or equivalent experience
3–7 years in SRE, DevOps, or Systems Engineering
Strong Linux & shell scripting skills
Cloud experience: AWS, Azure
Kubernetes/ECS & Docker expertise
Proficiency in Python or Java
Experience with CI/CD and DevOps tooling
Strong grasp of distributed systems, networking & security fundamentals
Preferred Qualifications:
Observability tools (OpenTelemetry)
PostgreSQL experience
Configuration management: Ansible, Chef, Puppet
Experience with zero-downtime deployments or chaos engineering
Soft Skills:
Strong analytical & problem-solving abilities
Excellent communication and collaboration
Thrives in fast-paced environments
Passion for continuous improvement

Jobseeker

Looking For Job?
Search Jobs

Recruiter

Are You Recruiting?
Search Candidates