Site Reliability Engineer

Technology and Digital

United Arab Emirates
Posted on 3 months ago

About the job

We are seeking a skilled Data Site Reliability Engineer (SRE) who has experience with data platforms to join a dynamic international company. 


The ideal candidate will be responsible for ensuring the reliability, scalability, and performance of our systems and applications. As an SRE, you will collaborate closely with development and operations teams to design, implement, and maintain robust infrastructure solutions. You will also be involved in monitoring, troubleshooting, and optimizing our systems to meet the demands of our rapidly growing business. The role will sit in the cloud Engineering team where you will 

develop and maintain cloud-native technology:


● Highly scalable Kubernetes clusters

● Cloud Access management automation and integration with k8s


As Data Platform Site Reliability Engineering you will manage infrastructure and applications on

cloud computing platforms to deliver data processing, governance, and storage. 


As an SRE, you’ll need to solve problems that arise using empirical data, teamwork, and your own

unique expertise.


The Data Platform SRE will work directly with our data platform and engineering teams in an

embedded SRE model, operating in unison with the developers to deliver seamless experiences

for our customers.


Responsibilities:

  • Design, implement, and maintain scalable and reliable data infrastructure solutions for storing, processing, and analyzing large volumes of data.
  • Collaborate with data engineering and data science teams to define and implement operational requirements for data pipelines, ETL processes, and analytical workflows.
  • Automate deployment, configuration, and monitoring of data systems and services to ensure efficient and reliable operation.
  • Develop and maintain monitoring and alerting systems to proactively identify and address issues with data availability, quality, and performance.
  • Troubleshoot and resolve data-related issues in a timely manner, minimizing impact on downstream applications and users.
  • Implement data governance and security best practices to ensure the confidentiality, integrity, and availability of our data assets.
  • Perform capacity planning and performance tuning to optimize the performance and cost-effectiveness of our data infrastructure.
  • Participate in on-call rotations and respond to data-related incidents outside of regular business hours when necessary.
  • Evaluate and adopt new data technologies and tools to improve the efficiency, reliability, and scalability of our data infrastructure.
  • Document system designs, configurations, and operational procedures to facilitate knowledge sharing and collaboration.


Qualifications

● Strong sense of ownership and integrity demonstrated through clear communication and

collaboration

● Experience in architecting, developing, operating, and troubleshooting Kubernetes

clusters and/or other highly available systems at scale.

● Proficiency with the architecture, deployment, performance tuning, and troubleshooting of

open-source data analytics technologies, especially Apache Spark, Trino and related

software in a large-scale environment

● The ability to design, author, and release code in languages like Go, Python, or Java

● Acute drive to automate manual operations and to improve them through repeated

iteration

● Understanding of the Linux Operating System, standard networking protocols, and

components

● Experience with cloud-native services on AWS/GCP

● Hands-on experience managing large numbers of diverse systems with configuration

management or software delivery platforms (such as Terraform, Cloudformation, ArgoCD,

and Flux)

● Experience with deploying, supporting and monitoring new and existing services,

platforms, and application stacks

● Excellent troubleshooting and problem-solving skills

● Experience with scale testing, disaster recovery, and capacity planning

● Effective communication and collaboration skills: have the ability to drive and promote

technical partnerships across teams

● Incident response and/or incident management experience


This is a long term remote contract role, candidates in the surrounding regions are preferred due to the time zone. 


If the above matches your skillset , please apply

Apply Now