Senior MLOps Engineer

The X4 Group

Key ResponsibilitiesDesign and implement ML training, evaluation, and deployment pipelines using Azure ML, Azure AI Foundry, and Prompt Flow for LLM applications.Operate and manage SUSE Linux Enterprise (SLES) GPU clusters with NVIDIA H100 hardware: driver installation, CUDA/NCCL tuning, multi‑GPU and distributed training support.Integrate MLflow tracking, Azure ML model registry, and Model Catalog for unified model versioning and promotion.Deploy GPU‑based inference endpoints: Managed Online Endpoints, AKS GPU node pools, or Arc‑enabled Kubernetes, with traffic splits and rollback strategies.Incorporate Responsible AI practices: automated evaluations, bias/fairness checks, logging & monitoring via Azure’s dashboards.Monitor model performance, data drift, and GPU metrics (utilization, memory, throughput) via Azure Monitor, Log Analytics, and NVIDIA DCGM Exporter integration.Automate CI/CD in Azure DevOps—from data prep to model deployment—using IaC (Terraform / Bicep) and strong governance gates.Collaborate with infrastructure, DevSecOps, and data engineering teams to ensure GPU resources are secure, governed, and cost‑optimized.Work with Azure’s offerings: feature store, compute targets, managed endpoints, pipelines, and observability best practices. Azure AI / ML Stack Additions & SpecificsUse Azure Machine Learning workspace(s) for organizing experiments, compute targets, metrics, and security boundaries.Leverage Model Catalog (Azure ML / Foundry) for discovering foundation models (OpenAI, HuggingFace, NVIDIA, etc.) and integrating them into workflows.Utilize Prompt Flow to prototype, test, and deploy LLM-based applications.Adopt Azure ML managed compute clusters (GPU‑enabled), compute instances, and scalable job submission mechanisms.Apply Azure’s Well‑Architected ML guidance: network isolation (private endpoints), workspace security, constrained package use, checkpointing, and multi‑region deployment.Enable multi‑stage model promotion pipelines using Azure ML registries and gated approvals.Use Azure AI Foundry for agent orchestration, evaluation, and integrated model operations capabilities. Technical StackAzure ML, Azure AI Foundry, Prompt Flow, Model Catalog / RegistriesSUSE Linux Enterprise (SLES), NVIDIA CUDA, NCCL, MIG configurationsKubernetes with GPU node support (AKS, Arc‑enabled K8s)Infrastructure as Code: Terraform, Bicep, Azure CLICI/CD: Azure DevOps Pipelines (build/train/deploy workflows)Monitoring / Telemetry: Azure Monitor, Log Analytics, Prometheus, DCGM ExporterResponsible AI tools and dashboards, evaluation SDKs Preferred Skills & Experience7+ years in ML/AI engineering or MLOps roles, with significant GPU workload experience.Hands‑on with SUSE Linux (SLES) in production AI environments.In‑depth knowledge of NVIDIA H100 architecture (HBM3, NVLink, MIG, multi‑GPU orchestration).Experience with distributed training (DeepSpeed, Horovod, PyTorch DDP).Proven track record deploying ML models to production in hybrid (cloud + on‑prem) environments.Experience implementing governance, security, and compliance for ML platforms.

Job Alerts

Get notified when new positions matching your interests become available at {organizationName}.

Need Help?

Questions about our hiring process or want to learn more about working with us?