DevOpsEnterprise

Site Reliability Engineer (SRE)

We are hiring a Site Reliability Engineer to ensure our production systems are reliable, performant, and scalable. You will define and enforce service level objectives, build observability into every layer of our stack, and drive a culture of operational excellence through automation and blameless post-mortems. This role bridges software engineering and operations with a strong bias toward eliminating toil through code.

Key Responsibilities

  • Define, track, and enforce SLOs/SLIs/SLAs across production services and work with teams to meet reliability targets
  • Build and maintain observability infrastructure — metrics, logs, traces, and dashboards
  • Lead incident response during production outages, coordinate cross-team communication, and author post-mortems
  • Identify and eliminate toil through automation, self-healing systems, and better tooling
  • Conduct capacity planning and performance modeling to ensure systems scale with traffic growth
  • Design and implement chaos engineering experiments to proactively find system weaknesses
  • Collaborate with product engineering teams on architecture reviews focused on reliability and fault tolerance

Required Skills & Experience

  • 4+ years of experience in SRE, DevOps, or Infrastructure Engineering roles
  • Strong software engineering skills in at least one language (Python, Go, or Java)
  • Deep understanding of distributed systems concepts (consistency, availability, partition tolerance)
  • Experience defining and operating against SLOs/SLIs and error budgets
  • Hands-on experience with observability platforms (Datadog, New Relic, Grafana, or Splunk)
  • Proficiency with cloud platforms (AWS, GCP, or Azure) and container orchestration (Kubernetes)
  • Experience with incident management processes and tools (PagerDuty, Opsgenie, incident.io)
  • Strong understanding of Linux systems internals and networking

Nice-to-Have

  • Experience with chaos engineering tools (Gremlin, Litmus, Chaos Monkey)
  • Background in performance engineering or load testing (k6, Locust, Gatling)
  • Familiarity with eBPF-based observability tools
  • Experience writing internal SRE runbooks and on-call documentation
  • Contributions to open-source reliability or observability projects

Tech Stack

KubernetesPrometheusGrafanaDatadogPagerDutyTerraformGoPythonAWSOpenTelemetry

What We Offer

  • Competitive salary and equity at [Company Name]
  • On-call compensation and generous incident recovery time
  • Annual budget for certifications, conferences, and learning
  • Flexible work arrangements — remote or hybrid
  • Comprehensive health benefits and 401(k) matching
  • Quarterly team offsites and company-wide hackathons

Interview Process

  1. 1Recruiter phone screen (30 min)
  2. 2Technical conversation with an SRE team member — discuss past incidents and reliability work (45 min)
  3. 3Hands-on troubleshooting exercise: diagnose a simulated production issue using logs, metrics, and traces (60 min)
  4. 4System design round: design a highly available architecture with defined SLOs (60 min)
  5. 5Leadership and culture fit conversation with Engineering Manager (30 min)
  6. 6Optional: meet the team over a casual coffee chat