A Day in the Life at CME

  • Fabrice Scoupe, Site Reliability Engineer, shares the challenges and rewards of his role, along with the skills needed to succeed

    Q. With over 20+ years as a software engineer, what was your motivation to transition to a site reliability engineer (SRE)?

    My main motivation was curiosity: after over two decades of working mostly in product development, I wanted to learn more about product operations. With the growth of virtualisation, containerisation and cloud computing, I had been more and more exposed to tools and activities that may traditionally have been considered part of operations (such as Terraform or Kubernetes). Cloud computing encourages engineers to manage infrastructure and configuration as code, and blurs the line between “devs” and “ops” - everybody ends up using similar tools and processes. SREs also tend to look at the system as a whole, whereas developers will focus in more depth on particular parts. SREs get to understand the product more broadly, and “in the wild” (i.e. production), whereas devs would work in earlier stages such as dev or QA environments. I like seeing the bigger picture, and I enjoy troubleshooting complex issues, so SRE was an attractive role. It is all about managing risk: neither too little, which stifles innovation; nor too much, which causes outages.

    Q. What are the core responsibilities and functions of your role and how do these differentiate from traditional DevOps?

    There are subtle differences between both. SREs focus on reliability as perceived by the external or internal users of their product. Making sure the product behaves in production as expected by the customers is one of their core responsibilities, whereas DevOps is more about removing the “wall” between development and operations to improve and accelerate the software development life cycle.

    Q. As systems increasingly demand greater availability, performance and capacity, to what extent can automation solve these problems?

    A service expected to be available 99.99% of the time can only afford to be down for a total of about 4 minutes every month; so automation is necessary to resolve any issues. The scale and complexity of modern services is also unmanageable without it, and Artificial Intelligence/Machine Learning certainly helps with performance analysis and improvement, as well as capacity planning.

    Q. Is it possible to have a ‘normal day‘ in your role or is every day a different challenge?

    SREs spend their time between operations and projects. Operations will typically involve some relatively regular, routine tasks and unexpected incidents that need to be investigated and mitigated, while projects are similar to development and involve planning, design, testing, etc. This mix gives a good balance between different kinds of problem-solving (fast troubleshooting vs longer-term system design), high-focus tasks and more relaxed activities. Dealing with incidents can be exciting - and stressful - so it is good to also get time for longer-term, slower-pace work such as imagining and building new functionality.

    Q. What would you consider the key attributes that make an excellent site reliability engineer?

    Good troubleshooting skills are essential since SREs are expected to triage and mitigate incidents quickly. SREs also need to be able to solve problems with code, although not necessarily to the same extent as software engineers. They need to be more pragmatic and hands-on than dogmatic: incidents need to be mitigated immediately so a more long-term solution may have to wait for later. They should be good at improvising and comfortable with uncertainty and ambiguity, yet have a strong attention to detail. Good knowledge of computer systems is very useful too. For more senior SREs, system design skills are also important. And as for all other types of engineers, curiosity, willingness to learn and communication skills are very important, too.

    Q. Is there a traditional or preferred pathway into site reliability engineering as a career?

    There is no preferred pathway to becoming a SRE. Recruits tend to come from two broad types of background: those with systems/hardware engineering and system administration experience, who also have good coding/scripting skills, and people with software development experience, who also have a good understanding of the systems (networks, storage, OS, etc). An ideal SRE team would mix people from both backgrounds, as they complement and can learn from each other.

    This article appears in the skills, education and tech careers edition of Sync NI magazine. To receive a free copy click here.

Share this story