Senior Engineer - Site Reliability

Business Unit: Shared Services

Division: Information Technology

I. BASIC PURPOSE / JOB BRIEF:

Senior Engineer – Site Reliability is responsible for ensuring the reliability, performance, and availability of applications, services and underlying infrastructure by employing monitoring and observability solution as well as creation and maintenance of automation scripts to ensure optimum level across all technology stack.

II. MAJOR RESPONSIBILITIES AND DUTIES:

Configure and maintenance of the enterprise monitoring tool to provide realtime visibility and state of health across the technology stack
Design and create dashboards to provide multi-level view based on functional requirement such as executive and tactical views
Create and maintain key threshold across all monitoring elements to ensure proactive detection and early detection of impending incident or problem
Analyze events and correlate to all observability and monitoring tools to capture trends and behavior patterns to assist in proactive course of actions
Design, develop and utilize automation tools and scripts to address repetitive actions and where possible create correction course of action to prevent and/or reduce prolonged outages
Work closely with operations team during incident and problem management for quick reaction response as identified using the monitoring tools
Regularly review and optimize infrastructure performance using logs, metrics and traces as part of continuous improvements thru adjustment of thresholds and monitoring requirement as environment constantly change
Develop and maintain a robust alerting strategy, including integration with on-call tools to ensure timely escalation and resolution of critical issues.
Implement and manage end-to-end event lifecycle processes to ensure accurate incident detection and efficient response.

III. JOB SPECIFICATIONS:

Educational Requirement:

Bachelor’s degree in Computer Science, Information Technology, or a related field; or equivalent work experience.

Experience Requirement:

2–5+ years of extensive experience as systems and network administrator
Hand-on experience managing monitoring tools such as but not limited to Solarwinds, Nagios, etc.
Evident understand what Observability and what it does

Skills and Attributes:

Proficient with major cloud platforms such as AWS, GCP, Azure and Alibaba Cloud
Hands-on experience with SNMP based monitoring tools such as Solarwinds, Nagios, CheckMK, etc.
Good grasp on Observability platform such as Splunk and Dynatrace
Experience with containerization platform such as Docker and Kubernetes
Extensive experience with virtualization technology such as VMWare
Strong knowledge of networking using collapsed architecture or similar enterprise networking technology
Knowledgeable in scripting languages such as Python, Bash, or PowerShell.
AWS Certified Solutions Architect, Azure Solutions Architect, or equivalent certification.
Certified Kubernetes Administrator (CKA)Solid understanding of disaster recovery and business continuity practices.

Other Qualifications:

Strong analytical skills to identify, troubleshoot, and resolve complex technical issues.
Excellent verbal and written communication skills for interacting with team members, stakeholders, and end-users. Ability to explain technical concepts to non-technical audiences.
Ability to work effectively in a team environment and collaborate with other IT Groups
Effective prioritization and management of multiple tasks and projects.
Flexibility to adapt to changing technologies, tools, and business requirements.
Proactive in identifying areas for improvement and suggesting enhancements.
Should be able to train junior team members
Ability to work under pressure and remain decisive

Senior Engineer - Site Reliability

Business Unit: Shared Services

Division: Information Technology

Description:

I. BASIC PURPOSE / JOB BRIEF:

Senior Engineer – Site Reliability is responsible for ensuring the reliability, performance, and availability of applications, services and underlying infrastructure by employing monitoring and observability solution as well as creation and maintenance of automation scripts to ensure optimum level across all technology stack.

II. MAJOR RESPONSIBILITIES AND DUTIES:

Configure and maintenance of the enterprise monitoring tool to provide realtime visibility and state of health across the technology stack
Design and create dashboards to provide multi-level view based on functional requirement such as executive and tactical views
Create and maintain key threshold across all monitoring elements to ensure proactive detection and early detection of impending incident or problem
Analyze events and correlate to all observability and monitoring tools to capture trends and behavior patterns to assist in proactive course of actions
Design, develop and utilize automation tools and scripts to address repetitive actions and where possible create correction course of action to prevent and/or reduce prolonged outages
Work closely with operations team during incident and problem management for quick reaction response as identified using the monitoring tools
Regularly review and optimize infrastructure performance using logs, metrics and traces as part of continuous improvements thru adjustment of thresholds and monitoring requirement as environment constantly change
Develop and maintain a robust alerting strategy, including integration with on-call tools to ensure timely escalation and resolution of critical issues.
Implement and manage end-to-end event lifecycle processes to ensure accurate incident detection and efficient response.

III. JOB SPECIFICATIONS:

Educational Requirement:

Bachelor’s degree in Computer Science, Information Technology, or a related field; or equivalent work experience.

Experience Requirement:

2–5+ years of extensive experience as systems and network administrator
Hand-on experience managing monitoring tools such as but not limited to Solarwinds, Nagios, etc.
Evident understand what Observability and what it does

Skills and Attributes:

Proficient with major cloud platforms such as AWS, GCP, Azure and Alibaba Cloud
Hands-on experience with SNMP based monitoring tools such as Solarwinds, Nagios, CheckMK, etc.
Good grasp on Observability platform such as Splunk and Dynatrace
Experience with containerization platform such as Docker and Kubernetes
Extensive experience with virtualization technology such as VMWare
Strong knowledge of networking using collapsed architecture or similar enterprise networking technology
Knowledgeable in scripting languages such as Python, Bash, or PowerShell.
AWS Certified Solutions Architect, Azure Solutions Architect, or equivalent certification.
Certified Kubernetes Administrator (CKA)Solid understanding of disaster recovery and business continuity practices.

Other Qualifications:

Strong analytical skills to identify, troubleshoot, and resolve complex technical issues.
Excellent verbal and written communication skills for interacting with team members, stakeholders, and end-users. Ability to explain technical concepts to non-technical audiences.
Ability to work effectively in a team environment and collaborate with other IT Groups
Effective prioritization and management of multiple tasks and projects.
Flexibility to adapt to changing technologies, tools, and business requirements.
Proactive in identifying areas for improvement and suggesting enhancements.
Should be able to train junior team members
Ability to work under pressure and remain decisive

Be A Game Changer Join Okada Manila, Where Passion Meets Fun!

Senior Engineer - Site Reliability

I. BASIC PURPOSE / JOB BRIEF:

II. MAJOR RESPONSIBILITIES AND DUTIES:

III. JOB SPECIFICATIONS:

Other Qualifications:

Senior Engineer - Site Reliability

I. BASIC PURPOSE / JOB BRIEF:

II. MAJOR RESPONSIBILITIES AND DUTIES:

III. JOB SPECIFICATIONS:

Other Qualifications:

Be A Game Changer
Join Okada Manila, Where Passion Meets Fun!