Site Reliability Engineer – DevOps – Infrastructure
One of our top Clients int he Human Capital Management space is looking for a SRE Infrastructure automation candidate to join their team.
Our Client offers the vibe of a startup with the resources of a large company, and you’ll have plenty of opportunities to help drive and validate product strategy.
The Technology Stack
- Architecture – API-driven, distributed, service-oriented architecture
- Frontend : React/Redux, Swagger + Backbone/Handlebars/jQuery
- Backend: Java 8, Spring Boot, Solr, Kafka, Zuul
- Build: Jenkins, Docker, Artifactory, Gradle
- Infrastructure: AWS, Consul, Nomad, TerraForm, Vault, Salt, MySQL/RDS
- Quality Assurance: Mabl, Locust/BlazeMeter, python-based API test automation
Things you’ll do
- Work with product engineering teams as well as architects and application platform team and help them build services that scale and perform in a distributed microservice architecture.
- Create metrics which feed into our observability systems and dashboards to give insights for the developers and operations teams.
- Develop internal tools and help solve problems up and down the stack.
- Define alerting thresholds, assist in troubleshooting of application issues and help lead post mortems.
- Build out infrastructure to support these apps by using terraform to spin up aws environments, or creating docker containers to deploy into our schedulers.
- Try to break stuff! Our apps must be battle tested and able to withstand different failure scenarios. Load/stress testing, chaos engineering type experiments, anything to make sure our customers never deal with downtime or failure. We thrive at 99.99% availability.
- Participate in oncall alongside developers and be available to work with our client/customer support teams if necessary.
What we’re looking for
- Designed, deployed, and managed large cloud infrastructures such as AWS, GCE, Azure, etc.
- Solid understanding of managing cloud native application systems.
- Excellent oral and written communication. Ability to convey ideas internally to co-workers as well as externally through meetups and talks.
- Ability to understand distributed software architectures and troubleshoot them from infrastructure through application layers.
- Experience with containers and how they work internally. Deployment into a production environment using a scheduler like Nomad / Kubernetes / Mesos / ECS a plus.
- Implemented a service discovery system using tools like Consul/Smartstack/EtcD for dynamic environments.
- Ability to write code/scripts using languages such as Python, Go, Ruby.
- Passion for technology and desire to push our tech stack forward.
- Be a team player and work closely with developers and operations.
- Experience with Monitoring, Instrumentation and performance engineering.
Nice to have
- Experience in a “continuous delivery/deployment” environment and supporting tools.
- Worked with API/SPA architectures
- Knowledge of service mesh technologies like Envoy/LinkerD
- Experience with Java and JVM in a production environment.
- Ability to configure and customize monitoring tools (Prometheus, Grafana, New Relic, Graphite, etc)