Job Description
Description
Senior Observability & Performance Engineer (Node.js & Java)
Senior Observability & Performance Engineer (Node.js & Java)
Get AI-powered advice on this job and more exclusive features.
Our team is responsible for ensuring the reliability, performance, and scalability of our critical applications. We operate in a dynamic public cloud environment and are constantly striving to improve the visibility and health of our systems.
The Opportunity
We are seeking a highly skilled and proactive Senior Observability & Performance Engineer to join our team. In this role, you will be instrumental in gaining deep insights into our existing Node.js and Java-based microservices, understanding their current instrumentation, and driving initiatives to measure and optimize their performance. You will play a critical role in evolving our observability practices to proactively identify bottlenecks, improve system efficiency, and enhance the overall user experience.
What you will do
- Deep Dive into Existing Codebases: Jump into existing Node.js and Java applications to understand how metrics, logs, and traces are currently being generated and consumed.
- Evaluate & Enhance Instrumentation: Assess the quality and completeness of existing observability data. Identify gaps in instrumentation and implement improvements to capture crucial performance metrics and contextual information (logs, traces).
- Define & Implement Performance Metrics: Collaborate with development teams to define key performance indicators (KPIs), service level indicators (SLIs), and service level objectives (SLOs) for our applications and services.
- Establish Performance Baselines & Monitoring: Implement robust monitoring and alerting solutions using tools like InfluxDB and Prometheus to track defined metrics, identify deviations from baselines, and proactively detect performance degradations.
- Performance Analysis & Root Cause Identification: Analyze performance data to identify bottlenecks, diagnose issues, and pinpoint the root cause of performance problems in distributed systems.
- Capacity Planning & Optimization: Utilize performance insights to assist with capacity planning and recommend architectural or code changes for performance optimization and resource efficiency.
- Troubleshooting & Incident Response: Support incident response by leveraging observability tools to quickly identify and troubleshoot production issues related to performance and reliability.
- Collaboration & Knowledge Sharing: Work closely with Node.js and Java development teams to evangelize observability best practices, guide on effective instrumentation, and foster a culture of performance-aware development.
- Tooling & Automation: Contribute to the development and maintenance of observability tools and automation that streamline data collection, analysis, and visualization.
- Continuous Improvement: Continuously research and evaluate new observability patterns, tools, and technologies to enhance our monitoring capabilities.
What you will bring
- Proven experience as an Observability Engineer, Performance Engineer, or SRE with a strong focus on system performance and monitoring.
- Expertise with Node.js and Java application ecosystems , including understanding their runtime characteristics, common performance pitfalls, and best practices for instrumentation.
- Strong hands-on experience with observability platforms and tools , specifically:
- InfluxDB: For time-series data storage and querying.
- Prometheus: For metrics collection and alerting.
- Familiarity with other tools like Grafana (for visualization), distributed tracing solutions, and log management systems (e.g., ELK Stack) is highly desirable.
- Solid understanding of performance testing methodologies (load testing, stress testing, scalability testing).
- Experience working with public cloud infrastructure (AWS, Azure, GCP, etc.) and cloud-native architectures (microservices, containers).
- Familiarity with Kubernetes and container orchestration.
- Ability to read and understand code to identify performance-related areas for improvement.
- Excellent analytical and problem-solving skills with a data-driven approach.
- Strong communication and collaboration skills, with the ability to work effectively with development and operations teams.
- Proactive mindset with a passion for optimizing system performance and reliability.
Education & Experience
- Bachelor’s degree in Computer Science, Software Engineering, or a related technical field.
- 5+ years of development experience (Node.js and Java)
- 3+ years of progressive experience in roles such as Observability Engineer, Performance Engineer, Site Reliability Engineer (SRE), or a similar capacity with a dedicated focus on system performance, monitoring, and reliability.
- Demonstrated experience with deep dives into existing codebases (Node.js and Java), evaluating and enhancing instrumentation, and defining/implementing performance metrics.
- Proven history of implementing and utilizing observability platforms and tools like InfluxDB and Prometheus in production environments
- Shell scripting experience
- Understanding of Linux operating systems
- Some AWS experience (Storage, Compute, Networking)
- Strong troubleshooting skills
Bonus points if you have
- Experience with Infrastructure as Code (IaC) tools (Terraform, Salt).
- Experience with large cloud projects
- AWS specific knowledge of EC2, S3, VPC, Classic ELB/ NLB/ALB, Lambda, Cloudwatch, VPC, Transit Gateway
- Experience with chaos engineering principles
- Contributions to open-source observability projects.
Supervisory Responsibility
This role will not have any supervisory requirements.
This job operates in a professional office environment. This role routinely uses standard office equipment such as laptop computers, photocopiers and smartphones.
Physical Demands
This role requires extended periods of sitting or standing at a computer workstation.
Position Type/Expected Hours of Work
This is a full-time position. Days and hours of work are Monday through Friday, during normal business hours. This position will also participate in on-call rotation which will be 2 weeks of primary and 2 weeks of secondary. This is offering 24/7 support for the platform during these rotations.Typically, this is 4 out of every 8 weeks.
Travel
Travel requirement is less than 5%, and may vary based on business needs.
Other Duties
Please note this job description is not designed to cover or contain a comprehensive listing of activities, duties or responsibilities that are required of the employee for this job. Duties, responsibilities and activities may change at any time with or without notice.
EEO Statement
Paymentus is an equal opportunity employer. We enthusiastically accept our responsibility to make employment decisions without regard to race, religious creed, color, age, sex, sexual orientation, national origin, ancestry, citizenship status, religion, marital status, disability, military service or veteran status, genetic information, medical condition including medical characteristics, or any other classification protected by applicable federal, state, and local laws and ordinances. Our management is dedicated to ensuring the fulfillment of this policy with respect to hiring, placement, promotion,
Company
Paymentus
Location
Richmond Hill
Country
Canada
Salary
125.000
URL