Production ML Engineer (LLMs, Image Gen, Personalization) - Contract to Hire

Upwork

Note: You must be comfortable working on products that can involve spicy subject matter and mature themes.

About the Role

We’re looking for a production-minded ML Engineer to lead and own core AI/ML systems across LLMs, image generation, and personalization. This is a hands-on engineering role focused on shipping high-impact features quickly and reliably—not research for research’s sake.

What You’ll Work On

  • LLM systems: prompting/orchestration, chat memory, RAG, and personalization
  • Training & fine-tuning: LLMs / Diffusers / TTS with reproducible pipelines (LoRA/QLoRA, PEFT)
  • High-perf inference: real-time serving with vLLM / TGI, ONNX Runtime, TensorRT-LLM, Triton Inference Server, Hugging Face Accelerate
  • GenAI features: image generation (SDXL/Diffusers), TTS/STT, and occasional video workflows
  • Reliability & insight: evaluation harnesses, observability/telemetry, and latency/throughput tuning
  • Ownership: model/version lifecycle, CI/CD, model registries, and performance regressions

What You Bring

  • 4+ years ML engineering with shipped, production workloads
  • Strong LLM experience (fine-tuning, prompt strategies, RAG, evals, safety/guardrails)
  • Image gen experience (Diffusers, SDXL; ControlNet/IP-Adapter is a plus)
  • Proficiency with Python, Docker, and cloud deployment (AWS/GCP/Azure)
  • Inference optimization on GPUs (CUDA fundamentals, quantization, batching, KV-cache tricks)
  • Startup mindset: iterative delivery, bias to action, crisp communication

Bonus Points

  • TTS/STT (Whisper, VITS/FastPitch, NeMo, 11labs API familiarity)
  • Personalization systems, chat memory stores, multi-modal pipelines
  • Distributed training (DeepSpeed, FSDP, Ray) and model versioning/registries (MLflow)
  • Vector search (pgvector, Milvus, Pinecone, Weaviate) and retrieval quality tuning
  • Experience with evaluation frameworks (Ragas/DeepEval) and observability (OpenTelemetry, Langfuse, Prometheus/Grafana)

Responsibilities (Contractor)

LLM Engineering

  • Build, refactor, and productionize LLM inference modules
  • Maintain and evolve API endpoints for AI services
  • Migrate/deploy models across cloud providers; manage scaling/rollbacks
  • Support training, memory systems, and semantic search integrations

AI Systems & Infrastructure

  • Design and implement robust AI pipelines (evals, telemetry, fine-tuning, data curation)
  • Stand up end-to-end observability and evaluation with clear SLOs
  • Own performance: profiling, caching, batching, speculative decoding, paged attention

Why This Work Matters

Your work makes our AI features reliable, scalable, and measurable by:

  • Enabling multi-cloud deployment for flexibility and cost control
  • Improving output quality via guardrails, observability, and systematic evaluation
  • Powering personalization with solid training pipelines and prompt management
  • Providing business-critical APIs that unify AI/ML functionality across the product

Our Typical Stack

Python • PyTorch • Hugging Face (Transformers, Diffusers, Accelerate, PEFT) • vLLM or TGI • ONNX Runtime • TensorRT-LLM • Triton Inference Server • CUDA/FlashAttention • bitsandbytes/quantization • Ray/Prefect/Airflow • MLflow/Weights & Biases • Postgres + pgvector/Milvus/Pinecone • Redis • Kafka/PubSub • OpenTelemetry • Prometheus/Grafana • Langfuse

Engagement Details

  • Contract (hourly or milestone-based)
  • Remote; 3–4 hours overlap with US Eastern Time preferred
  • Start: immediate

How to Apply (please include)

1. Links to 1–3 shipped projects or repos showing production ML work (not just notebooks)

2. A short note on how you cut inference latency or scaled throughput—be specific (numbers, tools, changes)

3. Your experience with fine-tuning (methods, data prep, evals)

4. Your hourly rate and earliest start date

Skills/Tags

Machine Learning, Deep Learning, Large Language Models (LLMs), Generative AI, PyTorch, Hugging Face, Prompt Engineering, RAG, Computer Vision, Diffusers/SDXL, ONNX Runtime, TensorRT, Triton Inference Server, MLOps, Model Serving, CUDA, Python, Docker, Kubernetes, Observability, MLflow, Langfuse, Vector Databases, Ray, Airflow/Prefect

Job Alerts

Get notified when new positions matching your interests become available at {organizationName}.

Need Help?

Questions about our hiring process or want to learn more about working with us?