On-prem Platform Engineer
  • Micasa Global
3 Days Ago
NA
C2C
Charlotte NC-NC
9-12 Years
Required Skills: LLMOps, GenAI Pipelines, On-prem and GCP integration, Azure integration, Inferentia, Alternative accelerators, Service mesh, Networking in GPU clusters
Job Description
LLM Inference & Optimization

vLLM, TensorRT-LLM, Triton Inference Server, SGLang
Inference optimization techniques:
Continuous batching
Speculative decoding
KV cache / Prefix caching
Model optimization:
FP8, AWQ, GPTQ

Distributed & GPU Systems
Tensor parallelism and large model scaling
CUDA, NCCL, GPU architecture
GPU partitioning & optimization (MIG)
Kubernetes & ML Serving
Kubernetes-based ML serving platforms
KServe, OpenShift AI
Helm charts, Operators, platform automation
GPU Orchestration
Run: AI or similar GPU scheduling/orchestration platforms
Multi-tenant GPU workload management
Platform Engineering
Experience building internal AI/ML platforms (on-prem or hybrid)
Strong automation and system design mindset
Observability & Performance
Prometheus, Grafana
ML observability (model latency, throughput, drift, resource utilization)
Performance benchmarking and tuning

Good to Have / Preferred Skills
Experience with LLMOps / GenAI pipelines
Exposure to hybrid cloud (on-prem + GCP/Azure integration)
Familiarity with Inferentia / alternative accelerators
Knowledge of service mesh / networking in GPU clusters
·       Build, configure, and operate on‑prem Kubernetes/OpenShift AI platforms for deploying and serving GenAI models and LLM inference workloads.
·       Design and optimize high‑performance inference stacks using vLLM, TensorRT‑LLM, Triton Inference Server, SGLang, and advanced techniques (continuous batching, speculative decoding, KV caching).
·       Manage GPU orchestration and capacity using Run: AI, MIG, CUDA/NCCL, and tensor parallelism to maximize utilization and throughput.
·       Deploy and operate Kubernetes ML serving frameworks (KServe, Helm, Operators) for scalable, reliable model serving.
·       Drive inference optimization and benchmarking, leveraging FP8, AWQ, GPTQ, and performance tools such as GuideLLM and Locust.
·       Implement observability and ML monitoring using Prometheus, Grafana, Arize AI, ensuring SLA/SLO compliance for GenAI services.
·       Collaborate with ML and research teams to onboard new models, tune inference performance, and productionize GenAI use cases.

Jobseeker

Looking For Job?
Search Jobs

Recruiter

Are You Recruiting?
Search Candidates