TekJobs - Easiest way to find your next right candidate

On-prem Platform Engineer

Micasa Global

56 Days Ago

Offered Rate NA

Tax Type C2C

Work location Charlotte NC-NC

Experience 9-12 Years

Required Skills: LLMOps, GenAI Pipelines, On-prem and GCP integration, Azure integration, Inferentia, Alternative accelerators, Service mesh, Networking in GPU clusters

Job Description

LLM Inference & Optimization

vLLM, TensorRT-LLM, Triton Inference Server, SGLang
Inference optimization techniques:
Continuous batching
Speculative decoding
KV cache / Prefix caching
Model optimization:
FP8, AWQ, GPTQ

Distributed & GPU Systems
Tensor parallelism and large model scaling
CUDA, NCCL, GPU architecture
GPU partitioning & optimization (MIG)

Kubernetes & ML Serving
Kubernetes-based ML serving platforms
KServe, OpenShift AI
Helm charts, Operators, platform automation

GPU Orchestration
Run: AI or similar GPU scheduling/orchestration platforms
Multi-tenant GPU workload management

Platform Engineering
Experience building internal AI/ML platforms (on-prem or hybrid)
Strong automation and system design mindset

Observability & Performance
Prometheus, Grafana
ML observability (model latency, throughput, drift, resource utilization)
Performance benchmarking and tuning

Good to Have / Preferred Skills
Experience with LLMOps / GenAI pipelines
Exposure to hybrid cloud (on-prem + GCP/Azure integration)
Familiarity with Inferentia / alternative accelerators
Knowledge of service mesh / networking in GPU clusters
·       Build, configure, and operate on‑prem Kubernetes/OpenShift AI platforms for deploying and serving GenAI models and LLM inference workloads.
·       Design and optimize high‑performance inference stacks using vLLM, TensorRT‑LLM, Triton Inference Server, SGLang, and advanced techniques (continuous batching, speculative decoding, KV caching).
·       Manage GPU orchestration and capacity using Run: AI, MIG, CUDA/NCCL, and tensor parallelism to maximize utilization and throughput.
·       Deploy and operate Kubernetes ML serving frameworks (KServe, Helm, Operators) for scalable, reliable model serving.
·       Drive inference optimization and benchmarking, leveraging FP8, AWQ, GPTQ, and performance tools such as GuideLLM and Locust.
·       Implement observability and ML monitoring using Prometheus, Grafana, Arize AI, ensuring SLA/SLO compliance for GenAI services.
·       Collaborate with ML and research teams to onboard new models, tune inference performance, and productionize GenAI use cases.

On-prem Platform Engineer

Required Skills: LLMOps, GenAI Pipelines, On-prem and GCP integration, Azure integration, Inferentia, Alternative accelerators, Service mesh, Networking in GPU clusters

Job Description

Similar Jobs

Enterprise Architect

Rust Engineer

Power Platform Architect

Rust Engineer

Looking For Job?

Are You Recruiting?

Starting...