Apply machine learning to operational data to predict and prevent system failures.
Automate DevOps practices including scaling, optimization, and controlled restarts.
Develop anomaly detection and self-healing frameworks.
Collaborate with SRE, development, and infrastructure teams to improve reliability.
Build AI-driven dashboards and reports to provide actionable insights.
Machine Learning & AI Frameworks (TensorFlow, PyTorch, scikit-learn)
Monitoring & Observability Tools (Splunk, Prometheus, Grafana, ELK Stack)
Automation & Scripting (Python, Bash, Ansible)
DevOps / SRE background
Product Owner experience
Containerization (Docker, Kubernetes)
8+ years of experience in machine learning
5+ years working with AI frameworks and observability tools
Strong cloud infrastructure knowledge
Proven experience with self-healing systems and proactive monitoring
Jobseeker
Recruiter