Site Reliability Engineer

BMC is looking for a Site Reliability Engineer to join our SaaS production and
delivery team to operate and assure SaaS service availability. The production
team serves as the center of BMC's advanced SaaS platform and provides 24×7
fast response to ensure maximal service availability and performance.

As part of the SRE team, you will be focusing on production ownership,
introducing new technologies and systems, pushing our production excellence,
and offering to the next levels. Along with using modern monitoring solutions,
such as Datadog. You will be using the latest and top technology services such
as K8S, MSK, SQS & various DB, etc.

In this role you will be:

* Own operational responsibility for the application and platform layers in a critical SLA.
* Oversee and own overall production – deployments, maintenances, and enhancements.
* Ensuring Production SaaS platform high availability working with various teams.
* Managing the production along with being the technical focal.
* Improve our systems, deployments, operations, and overall cloud activities.
* Responsible for deployment, tools, troubleshooting, and performance tuning.
* Manage incidents, guiding our 24×7 SRE team to understand their importance.
* Focusing on root cause analysis, prevention measures, and knowledge transfer.
* Develop and maintain processes, documentation, and automation.
* Support platform maintenance and testing initiatives.

What we are looking for:

* BSc in Engineering or equivalent experience
* At least 2 years’ experience in running solutions in production (e.g.: Technical lead, Support L3).
* Hands-on approach – Strong troubleshooting, problem-solving skills.
* Experience with a cloud provider (AWS preference) – a must.
* Experience with Docker and K8S.
* Practice in production incident management principles.
* Understanding of CI/CD concepts – Jenkins, BitBucket, GitLab.
* UNIX/Linux experience and system administration knowledge.
* Experience with major monitoring solutions (e.g. Datadog, logz.io, Prometheus).
* Write and maintain technical documents and standard operating procedures.
* Strong verbal and written communications in English and Hebrew.
* Team player, get-stuff-done attitude with self-learning skills.

It would be an advantage for you to have:

* Development background, automation scripting.
* Terraform, Ansible, Helm
* Understanding of security and networking of production environments.

מספר משרה: 8810

למה לעבוד קשה?

שלחו לנו קו"ח ותנו למשרה הנכונה למצוא אתכם