Position Expired
This job is no longer accepting applications.
Production ML Engineer (LLMs, Image Gen, Personalization) - Contract to Hire
Upwork
Note: You must be comfortable working on products that can involve spicy subject matter and mature themes.
About the Role
We’re looking for a production-minded ML Engineer to lead and own core AI/ML systems across LLMs, image generation, and personalization. This is a hands-on engineering role focused on shipping high-impact features quickly and reliably—not research for research’s sake.
What You’ll Work On
- LLM systems: prompting/orchestration, chat memory, RAG, and personalization
- Training & fine-tuning: LLMs / Diffusers / TTS with reproducible pipelines (LoRA/QLoRA, PEFT)
- High-perf inference: real-time serving with vLLM / TGI, ONNX Runtime, TensorRT-LLM, Triton Inference Server, Hugging Face Accelerate
- GenAI features: image generation (SDXL/Diffusers), TTS/STT, and occasional video workflows
- Reliability & insight: evaluation harnesses, observability/telemetry, and latency/throughput tuning
- Ownership: model/version lifecycle, CI/CD, model registries, and performance regressions
What You Bring
- 4+ years ML engineering with shipped, production workloads
- Strong LLM experience (fine-tuning, prompt strategies, RAG, evals, safety/guardrails)
- Image gen experience (Diffusers, SDXL; ControlNet/IP-Adapter is a plus)
- Proficiency with Python, Docker, and cloud deployment (AWS/GCP/Azure)
- Inference optimization on GPUs (CUDA fundamentals, quantization, batching, KV-cache tricks)
- Startup mindset: iterative delivery, bias to action, crisp communication
Bonus Points
- TTS/STT (Whisper, VITS/FastPitch, NeMo, 11labs API familiarity)
- Personalization systems, chat memory stores, multi-modal pipelines
- Distributed training (DeepSpeed, FSDP, Ray) and model versioning/registries (MLflow)
- Vector search (pgvector, Milvus, Pinecone, Weaviate) and retrieval quality tuning
- Experience with evaluation frameworks (Ragas/DeepEval) and observability (OpenTelemetry, Langfuse, Prometheus/Grafana)
Responsibilities (Contractor)
LLM Engineering
- Build, refactor, and productionize LLM inference modules
- Maintain and evolve API endpoints for AI services
- Migrate/deploy models across cloud providers; manage scaling/rollbacks
- Support training, memory systems, and semantic search integrations
AI Systems & Infrastructure
- Design and implement robust AI pipelines (evals, telemetry, fine-tuning, data curation)
- Stand up end-to-end observability and evaluation with clear SLOs
- Own performance: profiling, caching, batching, speculative decoding, paged attention
Why This Work Matters
Your work makes our AI features reliable, scalable, and measurable by:
- Enabling multi-cloud deployment for flexibility and cost control
- Improving output quality via guardrails, observability, and systematic evaluation
- Powering personalization with solid training pipelines and prompt management
- Providing business-critical APIs that unify AI/ML functionality across the product
Our Typical Stack
Python • PyTorch • Hugging Face (Transformers, Diffusers, Accelerate, PEFT) • vLLM or TGI • ONNX Runtime • TensorRT-LLM • Triton Inference Server • CUDA/FlashAttention • bitsandbytes/quantization • Ray/Prefect/Airflow • MLflow/Weights & Biases • Postgres + pgvector/Milvus/Pinecone • Redis • Kafka/PubSub • OpenTelemetry • Prometheus/Grafana • Langfuse
Engagement Details
- Contract (hourly or milestone-based)
- Remote; 3–4 hours overlap with US Eastern Time preferred
- Start: immediate
How to Apply (please include)
1. Links to 1–3 shipped projects or repos showing production ML work (not just notebooks)
2. A short note on how you cut inference latency or scaled throughput—be specific (numbers, tools, changes)
3. Your experience with fine-tuning (methods, data prep, evals)
4. Your hourly rate and earliest start date
Skills/Tags
Machine Learning, Deep Learning, Large Language Models (LLMs), Generative AI, PyTorch, Hugging Face, Prompt Engineering, RAG, Computer Vision, Diffusers/SDXL, ONNX Runtime, TensorRT, Triton Inference Server, MLOps, Model Serving, CUDA, Python, Docker, Kubernetes, Observability, MLflow, Langfuse, Vector Databases, Ray, Airflow/Prefect
Job Alerts
Get notified when new positions matching your interests become available at {organizationName}.
Need Help?
Questions about our hiring process or want to learn more about working with us?