Job Description
Summary
Description
- Configuration and monitoring of on-prem and cloud-based dependencies
-Automate continuous integration (CI) and continuous delivery (CD) pipelines
- Maintain staging and production environments with goal of maximizing uptimes
- Implement observability of systems for monitoring, alerting, and metrics reporting
- Generate reports regarding service metrics on performance, availability, and reliability
- Champion practices regarding change control management and incident response
A successful Atlassian Services Site Reliability Engineer will be expected to:
- Proactively communicate status of Atlassian services to stakeholders and follow through on time-sensitive tasks
- Demonstrate willingness to ask for clarification and increase awareness of the larger context
- Explore solutions to problems, evaluate risk vs reward, then execute best approach
- Communicate asynchronously with a global team across multiple timezones
- Document new processes or update existing documentation pages
- Eager and curious to learn across multiple technology stacks
Minimum Qualifications
- B.S. in Computer Science or related work experience
- Passion in building reliable, scalable, and performant distributed systems
- Understanding of distributed systems w.r.t. application, networking, and security
- SRE or Dev/Ops experience in managing customer-facing systems in 24/7 environment Experience in managing and monitoring fleets of *nix systems or container platforms
- Excellent judgment and integrity with ability to make timely and sound decisions
- Ability to anticipate the needs of others and adapt to changing conditions
Preferred Qualifications
- Experience as SCM administrator (e.g. Github, or similar)
- Experience with container platforms (e.g. Docker, or similar)
- Experience with monitoring and alerting (e.g. Prometheus, Grafana, or similar)
- Experience with data analysis (e.g. Splunk, or similar)