Senior Engineer - Site Reliability
I. BASIC PURPOSE / JOB BRIEF:
Senior Engineer – Site Reliability is responsible for ensuring the reliability, performance, and availability of applications, services and underlying infrastructure by employing monitoring and observability solution as well as creation and maintenance of automation scripts to ensure optimum level across all technology stack.
II. MAJOR RESPONSIBILITIES AND DUTIES:
- Configure and maintenance of the enterprise monitoring tool to provide realtime visibility and state of health across the technology stack
- Design and create dashboards to provide multi-level view based on functional requirement such as executive and tactical views
- Create and maintain key threshold across all monitoring elements to ensure proactive detection and early detection of impending incident or problem
- Analyze events and correlate to all observability and monitoring tools to capture trends and behavior patterns to assist in proactive course of actions
- Design, develop and utilize automation tools and scripts to address repetitive actions and where possible create correction course of action to prevent and/or reduce prolonged outages
- Work closely with operations team during incident and problem management for quick reaction response as identified using the monitoring tools
- Regularly review and optimize infrastructure performance using logs, metrics and traces as part of continuous improvements thru adjustment of thresholds and monitoring requirement as environment constantly change
- Develop and maintain a robust alerting strategy, including integration with on-call tools to ensure timely escalation and resolution of critical issues.
- Implement and manage end-to-end event lifecycle processes to ensure accurate incident detection and efficient response.
III. JOB SPECIFICATIONS:
Educational Requirement:
- Bachelor’s degree in Computer Science, Information Technology, or a related field; or equivalent work experience.
Experience Requirement:
- 2–5+ years of extensive experience as systems and network administrator
- Hand-on experience managing monitoring tools such as but not limited to Solarwinds, Nagios, etc.
- Evident understand what Observability and what it does
Skills and Attributes:
- Proficient with major cloud platforms such as AWS, GCP, Azure and Alibaba Cloud
- Hands-on experience with SNMP based monitoring tools such as Solarwinds, Nagios, CheckMK, etc.
- Good grasp on Observability platform such as Splunk and Dynatrace
- Experience with containerization platform such as Docker and Kubernetes
- Extensive experience with virtualization technology such as VMWare
- Strong knowledge of networking using collapsed architecture or similar enterprise networking technology
- Knowledgeable in scripting languages such as Python, Bash, or PowerShell.
- AWS Certified Solutions Architect, Azure Solutions Architect, or equivalent certification.
- Certified Kubernetes Administrator (CKA)Solid understanding of disaster recovery and business continuity practices.
Other Qualifications:
- Strong analytical skills to identify, troubleshoot, and resolve complex technical issues.
- Excellent verbal and written communication skills for interacting with team members, stakeholders, and end-users. Ability to explain technical concepts to non-technical audiences.
- Ability to work effectively in a team environment and collaborate with other IT Groups
- Effective prioritization and management of multiple tasks and projects.
- Flexibility to adapt to changing technologies, tools, and business requirements.
- Proactive in identifying areas for improvement and suggesting enhancements.
- Should be able to train junior team members
- Ability to work under pressure and remain decisive
Senior Engineer - Site Reliability
I. BASIC PURPOSE / JOB BRIEF:
Senior Engineer – Site Reliability is responsible for ensuring the reliability, performance, and availability of applications, services and underlying infrastructure by employing monitoring and observability solution as well as creation and maintenance of automation scripts to ensure optimum level across all technology stack.
II. MAJOR RESPONSIBILITIES AND DUTIES:
- Configure and maintenance of the enterprise monitoring tool to provide realtime visibility and state of health across the technology stack
- Design and create dashboards to provide multi-level view based on functional requirement such as executive and tactical views
- Create and maintain key threshold across all monitoring elements to ensure proactive detection and early detection of impending incident or problem
- Analyze events and correlate to all observability and monitoring tools to capture trends and behavior patterns to assist in proactive course of actions
- Design, develop and utilize automation tools and scripts to address repetitive actions and where possible create correction course of action to prevent and/or reduce prolonged outages
- Work closely with operations team during incident and problem management for quick reaction response as identified using the monitoring tools
- Regularly review and optimize infrastructure performance using logs, metrics and traces as part of continuous improvements thru adjustment of thresholds and monitoring requirement as environment constantly change
- Develop and maintain a robust alerting strategy, including integration with on-call tools to ensure timely escalation and resolution of critical issues.
- Implement and manage end-to-end event lifecycle processes to ensure accurate incident detection and efficient response.
III. JOB SPECIFICATIONS:
Educational Requirement:
- Bachelor’s degree in Computer Science, Information Technology, or a related field; or equivalent work experience.
Experience Requirement:
- 2–5+ years of extensive experience as systems and network administrator
- Hand-on experience managing monitoring tools such as but not limited to Solarwinds, Nagios, etc.
- Evident understand what Observability and what it does
Skills and Attributes:
- Proficient with major cloud platforms such as AWS, GCP, Azure and Alibaba Cloud
- Hands-on experience with SNMP based monitoring tools such as Solarwinds, Nagios, CheckMK, etc.
- Good grasp on Observability platform such as Splunk and Dynatrace
- Experience with containerization platform such as Docker and Kubernetes
- Extensive experience with virtualization technology such as VMWare
- Strong knowledge of networking using collapsed architecture or similar enterprise networking technology
- Knowledgeable in scripting languages such as Python, Bash, or PowerShell.
- AWS Certified Solutions Architect, Azure Solutions Architect, or equivalent certification.
- Certified Kubernetes Administrator (CKA)Solid understanding of disaster recovery and business continuity practices.
Other Qualifications:
- Strong analytical skills to identify, troubleshoot, and resolve complex technical issues.
- Excellent verbal and written communication skills for interacting with team members, stakeholders, and end-users. Ability to explain technical concepts to non-technical audiences.
- Ability to work effectively in a team environment and collaborate with other IT Groups
- Effective prioritization and management of multiple tasks and projects.
- Flexibility to adapt to changing technologies, tools, and business requirements.
- Proactive in identifying areas for improvement and suggesting enhancements.
- Should be able to train junior team members
- Ability to work under pressure and remain decisive