Job Description
Description
Role Overview
We are seeking an experienced and proactive Observability Lead to take ownership of the visibility, reliability, and performance monitoring of all production systems across the organisation.
This role is responsible for ensuring that infrastructure, applications, databases, and critical services are fully monitored in real time, enabling early issue detection, rapid incident response, and continuous service improvement. The ideal candidate will build a strong observability culture by implementing best-in-class monitoring, alerting, logging, and performance management practices.
You will work closely with Engineering, DevOps, Security, Product, and Support teams to maintain highly available and resilient systems in a fast-paced fintech environment.
Responsibilities
1. Observability Strategy & Ownership
- Develop and lead the company-wide observability strategy across infrastructure, applications, cloud environments, databases, and internal services.
- Establish monitoring standards, frameworks, and governance for all production workloads.
- Ensure real-time visibility into system health, performance, availability, and capacity.
- Build a proactive reliability culture through data-driven monitoring practices.
2. Monitoring & Alerting Management
- Ensure 100% monitoring coverage across all critical production services.
- Design, configure, and maintain dashboards, alerts, logs, metrics, and distributed tracing systems.
- Continuously optimise alert thresholds to reduce noise and eliminate false positives.
- Maintain centralised monitoring systems accessible to relevant teams.
3. Incident Detection & Operational Response
- Ensure incidents are detected internally before customer impact whenever possible.
- Lead operational response during outages, degradations, and system anomalies.
- Coordinate cross-functional teams during incident resolution.
- Drive post-incident reviews, root cause analysis (RCA), and corrective action plans.
4. Performance Monitoring & Optimization
- Track system latency, throughput, resource utilization, and application performance metrics.
- Identify performance bottlenecks and collaborate with engineering teams on remediation.
- Support load readiness, scaling decisions, and capacity planning.
- Improve platform stability and service responsiveness over time.
5. Reporting & Insights
- Produce weekly and monthly reports on system health, uptime, incident trends, and risk areas.
- Provide executive dashboards for leadership visibility into platform performance.
- Use operational data to recommend improvements and investment priorities.
6. Collaboration & Leadership
- Partner with Engineering, DevOps, Security, and Product teams to embed observability into all deployments.
- Support teams with troubleshooting, diagnostics, and production readiness reviews.
- Mentor engineers on monitoring best practices and observability tooling.
- Act as the subject matter expert for reliability monitoring and operational intelligence.
Requirements
Education & Experience
- Bachelor’s degree in Computer Science, Information Technology, Engineering, or related field.
- 5+ years of experience in Observability, Site Reliability Engineering (SRE), DevOps, Infrastructure Monitoring, or Production Operations.
- Experience in fintech, payments, telecom, banking, or mission-critical environments preferred.
Technical Skills
- Hands-on experience with observability tools such as Grafana, Prometheus, Datadog, New Relic, Signoz, ELK Stack, Splunk, AppDynamics, or similar.
- Strong understanding of metrics, logs, traces, and alerting systems.
- Experience with Linux servers, cloud platforms (AWS, Azure, GCP), and container environments.
- Knowledge of networking, databases, APIs, and distributed systems.
- Scripting skills in Python, Bash, or similar languages are an advantage.
Soft Skills
- Strong analytical and troubleshooting ability.
- Calm under pressure during incidents and outages.
- Strong communication and stakeholder management skills.
- Leadership mindset with ownership and accountability.
- Close attention to detail and a continuous improvement focus.
What Success Looks Like in This Role
- Production issues are identified before customers experience disruption.
- Leadership has real-time confidence in platform health and uptime.
- Engineers rely on strong dashboards and actionable alerts.
- System performance continuously improves through data-driven action.
- Downtime and recurring incidents reduce significantly over time.