DevOpsEnterprise

Site Reliability Engineer (SRE)

We are hiring a Site Reliability Engineer to ensure our production systems are reliable, performant, and scalable. You will define and enforce service level objectives, build observability into every layer of our stack, and drive a culture of operational excellence through automation and blameless post-mortems. This role bridges software engineering and operations with a strong bias toward eliminating toil through code.

Download DOCX Download PDF Use in Talantrix

Key Responsibilities

Define, track, and enforce SLOs/SLIs/SLAs across production services and work with teams to meet reliability targets
Build and maintain observability infrastructure — metrics, logs, traces, and dashboards
Lead incident response during production outages, coordinate cross-team communication, and author post-mortems
Identify and eliminate toil through automation, self-healing systems, and better tooling
Conduct capacity planning and performance modeling to ensure systems scale with traffic growth
Design and implement chaos engineering experiments to proactively find system weaknesses
Collaborate with product engineering teams on architecture reviews focused on reliability and fault tolerance

Required Skills & Experience

4+ years of experience in SRE, DevOps, or Infrastructure Engineering roles
Strong software engineering skills in at least one language (Python, Go, or Java)
Deep understanding of distributed systems concepts (consistency, availability, partition tolerance)
Experience defining and operating against SLOs/SLIs and error budgets
Hands-on experience with observability platforms (Datadog, New Relic, Grafana, or Splunk)
Proficiency with cloud platforms (AWS, GCP, or Azure) and container orchestration (Kubernetes)
Experience with incident management processes and tools (PagerDuty, Opsgenie, incident.io)
Strong understanding of Linux systems internals and networking

Nice-to-Have

Experience with chaos engineering tools (Gremlin, Litmus, Chaos Monkey)
Background in performance engineering or load testing (k6, Locust, Gatling)
Familiarity with eBPF-based observability tools
Experience writing internal SRE runbooks and on-call documentation
Contributions to open-source reliability or observability projects

Tech Stack

KubernetesPrometheusGrafanaDatadogPagerDutyTerraformGoPythonAWSOpenTelemetry

What We Offer

Competitive salary and equity at [Company Name]
On-call compensation and generous incident recovery time
Annual budget for certifications, conferences, and learning
Flexible work arrangements — remote or hybrid
Comprehensive health benefits and 401(k) matching
Quarterly team offsites and company-wide hackathons

Interview Process

1Recruiter phone screen (30 min)
2Technical conversation with an SRE team member — discuss past incidents and reliability work (45 min)
3Hands-on troubleshooting exercise: diagnose a simulated production issue using logs, metrics, and traces (60 min)
4System design round: design a highly available architecture with defined SLOs (60 min)
5Leadership and culture fit conversation with Engineering Manager (30 min)
6Optional: meet the team over a casual coffee chat

Hiring for this role? You might also need:

Interview Scorecards

Bar Raiser

Bar Raiser — Cross-Functional

Independent bar-raiser assessment ensuring the candidate raises the team's overall bar.

Culture & Values

Culture & Values Interview

Behavioral interview scorecard covering collaboration, ownership, and growth mindset.

Hiring Manager

Hiring Manager Final Round

Final evaluation by hiring manager: team fit, role alignment, and leadership potential.

Phone Screen

Recruiter Phone Screen (Universal)

General-purpose recruiter screen covering motivation, experience fit, and logistics.

Email Templates

Sourcing

Cold Outreach — Passive Developer

A personalized first-touch email to engage passive developers who aren't actively job hunting.

Interview Scheduling

Technical Interview Invitation

An email inviting a candidate to a technical interview with details on format, duration, and how to prepare.

Decision & Offer

Offer Letter Email

A congratulatory email extending a formal job offer with key terms and the attached offer letter.

Related Templates

DevOps

Common Screening Mistakes

Confusing SRE with traditional sysadmin or NOC roles — SRE is a software engineering discipline focused on reliability, not a monitoring-and-ticket-closing job
Not asking about SLOs and error budgets — these are the defining concepts of SRE; candidates who cannot discuss them likely have not worked in a true SRE model
Requiring exact tool-stack matches — an SRE who built observability with Prometheus can easily learn Datadog; focus on reliability principles

Red Flags

Cannot describe an incident they managed and what they learned from the post-mortem — incident response is the core of SRE work
Focuses only on monitoring dashboards but cannot explain what the metrics mean or how they relate to user experience
Shows a 'blame culture' mindset when discussing past outages instead of blameless post-mortem thinking

What Good Looks Like

A strong SRE candidate speaks in terms of SLOs and error budgets — can walk you through a major incident they handled from detection to resolution to post-mortem, and has specific examples of eliminating toil through automation (e.
g.
, 'automated failover reduced recovery time from 30 minutes to under 2 minutes').

Key Tech to Listen For

SLOs/SLIs
Kubernetes
Observability
Incident Response
Chaos Engineering
OpenTelemetry