Confidential
Cloud Engineer
Full TimeRemote$145k–$200k
Apply to this roleTakes ~2 minutes. Consent-first, your data, your control.
About the role
About Us
We are a staffing services technology company that helps organizations design, build, and scale digital products and engineering capabilities. Our teams deliver end-to-end software development, engineering, and design services, and we provide flexible staffing solutions to augment internal teams with specialized talent—quickly and reliably.
The Role:
We are seeking an innovative and resilient Cloud Engineer to join our distributed engineering team. This role focuses on designing, building, deploying, and operating scalable AI/ML infrastructure that enables product teams to prototype, train, and serve models with reliability and efficiency. You’ll bridge data science, backend engineering, and platform operations to ensure robust, observable, and cost-effective AI systems in production.
What You’ll Do
Cloud Architecture & Infra Design: Design and implement scalable, secure cloud architectures for AI/ML workloads across multiple environments (dev, staging, prod). Architect data pipelines, model training fleets, model serving endpoints, and incident response playbooks.
Platform & Automation: Build reusable platform components (CI/CD for ML, feature stores, model registry, experiment tracking, reusable pipelines) and automate deployment, scaling, and self-healing of AI services.
Model Deployment & Operations: Provision GPU/CPU clusters, manage containerized services (Docker/Kubernetes), implement inference caching, autoscaling, and canary/blue-green deployment strategies; monitor service health and model performance in production.
Observability & Governance: Instrument comprehensive monitoring, tracing, logging, and alerting; establish SLAs/SLOs for latency, availability, and model quality; implement cost controls and usage dashboards.
Collaboration & Delivery: Work closely with Data Scientists, ML Engineers, Backend Engineers, and DevOps in an Agile environment to operationalize experiments, standardize APIs, and maintain clear documentation.
Security & Compliance: Implement secure coding and deployment practices; manage IAM, encryption at rest/in transit, secret management, and compliance considerations for regulated data environments when applicable.
What We’re Looking For
Experience: 3+ years in cloud engineering, DevOps, or MLOps with production-grade systems; experience supporting AI/ML workloads is a plus.
Education: Bachelor’s or Master’s degree in Computer Science, Electrical/Computer Engineering, Mathematics, or a related field (or equivalent practical experience).
Cloud & Infra Proficiency: Strong hands-on experience with at least one major cloud provider (AWS, Azure, or GCP); familiarity with Kubernetes, containerization, and cloud-native services for compute, storage, and networking.
ML Infrastructure: Experience with ML lifecycle tooling (MLflow, Kubeflow, Weights & Biases, or equivalent) and feature stores/ML metadata management concepts; comfort with model serving frameworks and GPUs.
Automation & CI/CD: Proficient in CI/CD for data/ML workloads, IaC (Terraform, CloudFormation, ARM templates), Git workflows, and configuration management.
Programming & SRE Practices: Proficiency in Python or another language commonly used in ML ops; strong understanding of software engineering best practices (testing, code reviews, documentation).
Observability: Familiarity with monitoring/observability stacks (Prometheus, Grafana, OpenTelemetry, Cloud logging/monitoring services); ability to define and track SLOs/SLIs.
Communication: Clear written and verbal communication; ability to translate technical concepts for non-technical stakeholders.
Remote/Collaboration: Comfortable working asynchronously in a distributed team; self-motivated and capable of prioritizing tasks in a dynamic environment.
Adaptability: Comfortable handling rapid changes in priorities, diagnosing issues across distributed systems, and turning incidents into learnings.
Bonus Points
ML/AI Platform Experience: Hands-on with ML model training pipelines, distributed training, or serving architectures; experience with RAG, vector databases, or LLM inference at scale.
GPU & GPU Orchestration: Experience managing GPU clusters, job schedulers, and cost-optimized GPU usage.
Data Compliance: Familiarity with HIPAA, SOC 2, GDPR, or other regulatory frameworks; implementing differential privacy or federated learning considerations.
Industry Context: Experience deploying AI solutions in FinTech, Healthcare, E-commerce, or SaaS domains.
Certifications: Cloud provider certifications (e.g., AWS Certified DevOps Engineer, Azure DevOps Solutions Expert, Google Professional Cloud DevOps Engineer).
Compensation & Benefits
We believe in paying top-of-market rates for top-tier talent. The base salary range for this role is $145,000 to $200,000, with exact placement determined by your skills, years of experience, and interview performance.
Additional Benefits:
Equity: Competitive stock option package.
Remote Setup: Home office stipend to get your workspace set up perfectly.
Health: Comprehensive medical, dental, and vision insurance.
Time Off: Flexible PTO policy + Company Holidays.
Growth: Annual learning and development budget.
Retirement: 401(k) matching plan.
Responsibilities
Cloud Architecture & Infra Design: Design and implement scalable, secure cloud architectures for AI/ML workloads across multiple environments (dev, staging, prod). Architect data pipelines, model training fleets, model serving endpoints, and incident response playbooks.
Platform & Automation: Build reusable platform components (CI/CD for ML, feature stores, model registry, experiment tracking, reusable pipelines) and automate deployment, scaling, and self-healing of AI services.
Model Deployment & Operations: Provision GPU/CPU clusters, manage containerized services (Docker/Kubernetes), implement inference caching, autoscaling, and canary/blue-green deployment strategies; monitor service health and model performance in production.
Observability & Governance: Instrument comprehensive monitoring, tracing, logging, and alerting; establish SLAs/SLOs for latency, availability, and model quality; implement cost controls and usage dashboards.
Collaboration & Delivery: Work closely with Data Scientists, ML Engineers, Backend Engineers, and DevOps in an Agile environment to operationalize experiments, standardize APIs, and maintain clear documentation.
Security & Compliance: Implement secure coding and deployment practices; manage IAM, encryption at rest/in transit, secret management, and compliance considerations for regulated data environments when applicable.
Qualifications
Experience: 3+ years in cloud engineering, DevOps, or MLOps with production-grade systems; experience supporting AI/ML workloads is a plus.
Education: Bachelor’s or Master’s degree in Computer Science, Electrical/Computer Engineering, Mathematics, or a related field (or equivalent practical experience).
Cloud & Infra Proficiency: Strong hands-on experience with at least one major cloud provider (AWS, Azure, or GCP); familiarity with Kubernetes, containerization, and cloud-native services for compute, storage, and networking.
ML Infrastructure: Experience with ML lifecycle tooling (MLflow, Kubeflow, Weights & Biases, or equivalent) and feature stores/ML metadata management concepts; comfort with model serving frameworks and GPUs.
Automation & CI/CD: Proficient in CI/CD for data/ML workloads, IaC (Terraform, CloudFormation, ARM templates), Git workflows, and configuration management.
Programming & SRE Practices: Proficiency in Python or another language commonly used in ML ops; strong understanding of software engineering best practices (testing, code reviews, documentation).
Observability: Familiarity with monitoring/observability stacks (Prometheus, Grafana, OpenTelemetry, Cloud logging/monitoring services); ability to define and track SLOs/SLIs.
Communication: Clear written and verbal communication; ability to translate technical concepts for non-technical stakeholders.
Remote/Collaboration: Comfortable working asynchronously in a distributed team; self-motivated and capable of prioritizing tasks in a dynamic environment.
Adaptability: Comfortable handling rapid changes in priorities, diagnosing issues across distributed systems, and turning incidents into learnings.
Bonus Points
ML/AI Platform Experience: Hands-on with ML model training pipelines, distributed training, or serving architectures; experience with RAG, vector databases, or LLM inference at scale.
GPU & GPU Orchestration: Experience managing GPU clusters, job schedulers, and cost-optimized GPU usage.
Data Compliance: Familiarity with HIPAA, SOC 2, GDPR, or other regulatory frameworks; implementing differential privacy or federated learning considerations.
Industry Context: Experience deploying AI solutions in FinTech, Healthcare, E-commerce, or SaaS domains.
Certifications: Cloud provider certifications (e.g., AWS Certified DevOps Engineer, Azure DevOps Solutions Expert, Google Professional Cloud DevOps Engineer).
How your application is handled
When you apply, your resume and cover letter are uploaded and structured by Claude into hiring-relevant fields. You see every consent toggle individually and can withdraw at any time. We do not sell your data, do not share it with third-party brokers, and do not infer protected-class attributes. Read more on our privacy page.