Site Reliability Engineer
ABOUT LINGO
Lingo is building a cutting‑edge digital health platform that fuses continuous biosensor data, high‑performance backend engineering, and advanced analytics to help people live healthier, longer, fuller lives. Our systems process massive volumes of real‑time data, and maintaining the reliability, scalability, and security of our platform is mission‑critical to delivering value to our users.
THE OPPORTUNITY
We are looking for a Site Reliability Engineer (SRE) to join our Platform team and ensure Lingo’s biosensor platform runs reliably and efficiently at scale. You will be a key partner for Backend, Data, and Mobile teams, driving improvements across infrastructure, observability, incident management, and automation. Your goal is to enable highvelocity development with confidence, maintain multiregion uptime, and embed reliability practices across engineering. You’ll work in production Kubernetes environments, tune service meshes, evolve operational playbooks, and proactively prevent incidents through code, automation, and design.
WHAT YOU’LL DO
Establish and improve SLOs, SLIs, and SLAs across services; partner with engineering teams to embed reliability targets into product designs.
Build and evolve monitoring, alerting, and tracing systems to ensure rapid detection and resolution of issues.
Develop incident response processes, oncall rotations, and postmortem practices that drive continuous improvement.
Implement automation for deployment pipelines, failover, scaling, and capacity planning to reduce manual operations and error risk.
Champion security and compliancedriven infrastructure, including secrets management, secure networking, and audit readiness.
Collaborate on disaster recovery strategies and resilience testing (chaos engineering, load testing, rolling updates, blue/green deployments).
Partner with developers to identify performance bottlenecks, optimize services, and reduce infrastructure costs.
Contribute to internal tooling and developer experience to accelerate safe delivery of features in production.
REQUIRED QUALIFICATIONS
5+ years in Site Reliability Engineering, DevOps, or Cloud Infrastructure roles for distributed systems at scale.
Deep expertise with Kubernetes, container orchestration, and service meshes in production environments.
Strong skills in observability tooling (Prometheus, Grafana, OpenTelemetry, etc.) and incident management systems.
Experience designing HA/DR architectures, managing multiregion deployments, and optimizing for lowlatency traffic flows.
Proficiency with cloud platforms (AWS/GCP/Azure) and infrastructureascode (Terraform, Helm).
Security and compliance mindset, comfortable with regulated environments (HIPAA/GDPR) and auditing requirements.
Excellent crossfunctional communication and collaboration skills.
PREFERRED QUALIFICATIONS
Experience with streaming/messaging systems (Kafka, RabbitMQ) in production.
Background in digital health, IoT, or other missioncritical data platforms.
Familiarity with chaos engineering tools and costoptimization strategies for global cloud services.
Development experience in a modern backend language (Java, Kotlin, Go, Python) for tooling and automation.
LINGO CULTURE
Customerfirst, reliabilityobsessed, and teamoriented. At Lingo, SREs are guardians of uptime, performance, and developer velocity. You’ll help us move fast without compromising trust or quality.
The base pay for this position is
N/AIn specific locations, the pay range may vary from the range posted.