Senior Cloud Infrastructure Engineer - VideoTech Experienced

United States, California, San Francisco

icon
$ 175 - 250 K/year

Full-time

Posted on: 18 days ago

Skills

python
grafana
kubernets
Loki
Thanos
prometheus
terraform
linux
GenerativeAI
llm

Senior Cloud Infrastructure Engineer

Full-Time | On-site | San Francisco
Compensation: $175K – $250K + Competitive Equity
Experience: 5–12 Years


About the Role :

We are looking for a Senior Cloud Infrastructure Engineer who thrives in fast-paced environments and excels at building and scaling large-scale GPU compute platforms. You will play a crucial role in architecting, developing, and operating the foundational infrastructure that powers advanced AI workloads.

This role requires someone deeply technical, adaptable, and execution-oriented—more focused on solving hard problems than matching exact tools.


What You’ll Do :

  • Build and maintain the core Python-based platform that handles request routing, AI workload orchestration, GPU server capacity management, observability, and more.

  • Develop and manage infrastructure using Terraform, Ansible, and cloud provider APIs, supporting GPU fleets across cloud and potentially bare-metal environments.

  • Own and operate the platform’s foundational technologies, which may include:
    Kubernetes (K8s), FluxCD, Nomad, Prometheus, Thanos, Grafana, Loki, distributed networking, and storage systems.

  • Architect and implement solutions that significantly improve the performance, scalability, and availability of services used by millions of users.

  • Collaborate closely with engineering teams to design and build new infrastructure systems end-to-end.

  • Drive the long-term infrastructure roadmap (1/2/5 year planning) and influence best practices as the company scales.

  • Shape the technical direction of a highly ambitious engineering environment.

What We’re Looking For :

  • 5–12 years of experience as an Infrastructure Engineer, Cloud Engineer, SRE, or similar role.

  • Strong experience in Python, Linux, Cloud platforms (AWS preferred), Kubernetes, and distributed systems.

  • Hands-on experience with IaC tools such as Terraform and automation tools like Ansible.

  • Experience with monitoring/observability stacks: Prometheus, Loki, Grafana, Thanos, etc.

  • Strong problem-solving ability, ownership mindset, and a bias for rapid execution.

  • Ability to work in a small, high-performance team solving complex infrastructure challenges.

  • Willingness to relocate to San Francisco—although remote work is possible, in-person collaboration is preferred.

Tech Stack :

Python, Kubernetes, Terraform, Ansible, AWS, Prometheus, Grafana, Loki, Thanos, Linux


Interview Process :


Recruiter Screen

  1. Introductory Call with Leadership

  2. Technical Phone Interview

  3. Additional Leadership Conversation

  4. Onsite Technical Interview

  5. Take-Home Project

  6. Reference Checks