Beschreibung & Anforderungen
A Day in The Life Typically Includes:
- Monitor application operational performance and reliability indicators. Do basic analysis to find the cause, notify the relevant teams and drive for resolution.
- Identify/publish the required metrics to ensure visibility into all aspects of our application’s performance and stability from global/customer perspective.
- Define SLOs specific to services/products/customers.
- Building dashboards on Splunk and other monitoring tools, and alerts that continuously monitors the identified metrics and SLOs and reporting the Violators.
- Provide/Document RCA analysis on the Production outages/maintenance and provide automation solution which can improve/reduce the downtime.
- Develop automation code for infrastructure needs, testing solutions, failover solutions, failure mitigation, and much more.
- Proactively responding to alerting, incidents and making sure alerts/dashboards are up to date.
- Work independently and within a team to triage the production outages/maintenance and work towards the remediation of the same.
Basic Qualifications:
- Bachelor's degree or above in Computer Science, or related Engineering discipline.
- Minimum 1+ year of total industry experience.
- Minimum 6 months experience in building Dashboard/Report/Alerts in Splunk or any similar monitoring tool.
- Minimum 3 months experience in AWS.
- Hands on experience with any scripting language like Python.
- Good oral and written communication skills.
Preferred Qualifications:
- Hands on with Linux/Windows fundamentals, Shell scripting, hardware performance tuning/scalability, mitigating issues related to networking/security.
- Experience working with RDBMS (MSSQL, DB2), NoSQL, Caching, Queuing, knowledge on firewalls and load-balancers.
Overview
SRE team is looking for a dedicated SRE engineer who is ready to take up the challenges which is a mixture of troubleshooting, monitoring, code developing, automation, networking, tuning etc. across different layers of the application.
We are looking for engineers who can analyze, provide RCA, catch issues before it appears and develop/propose automated solutions for monitoring, performance, capacity planning, high availability and disaster response for on-prem/AWS application.
The team provides a unique opportunity to gain knowledge of Infor Nexus platform, various products & modules, along with knowledge across various teams, services (App Servers, Queueing, Caching & Data Services). This is a great opportunity to expand your technical horizons across different services/layers of the application regardless of the prior experience with the services/products.