Sync NI meets Marty Bell, Chief Software Architect at Rakuten’s Blockchain Lab to talk all things DevOps and some of the challenges and opportunities developing in the Cloud.
DevOps is a set of practices, tools, and a cultural philosophy that automate and integrate the processes between software development and IT teams. What does ‘Cultural Philosophy’ mean to you personally and Rakuten in general?
For context, from the very first days of the Rakuten Blockchain Lab we took the decision to host our services on public cloud rather than Rakuten's private data centres so it forced us to embrace that philosophy of integrating our development and operations and taking full ownership of both. This applied to not only the dev and test environments, but also our production environments.
The core values and Principles of the Rakuten Group are embodied in Rakuten Shugi which includes the Five Principles for Success. One of those is "Shikumika" which roughly translates to "systemise". For DevOps that means we want to codify everything we do in creating capabilities that are repeatable and self service for the engineering teams. As much as possible we don't want to have tasks that are purely for DevOps to undertake as it adds friction and manual steps that can only slow down the engineering teams. The DevOps Vision Statement in Rakuten is "Be Relentless in Automation for Frictionless Engineering.”
DevOps engineers typically juggle between different tasks like coding, integrating, and testing requiring not to mention security and compliance. What do you consider to be the most important skill sets to be a successful DevOps engineer?
Almost anything you would say about a software engineer I would say applies to a DevOps engineer. An engineer's craft (broadly) is to design, build & test software to an agreed requirement, DevOps is no different. Certainly on the DevOps side I've seen the design part of that lifecycle get skipped, going straight from a request to implementation. That can seem straight forward enough however while the resulting script or package may work, it's usually not that reusable or maintainable and is just creating an issue for another day.
Circling back on skillsets, you do need to be able to manage many spinning plates and context switch at the drop of a hat. The role is multi-faceted in that you have roadmap activities that are planned and then you have to react to supporting engineers or looking into issues in the toolchain or environments. Our goal is to spend more time on planned work and less on fire-fighting.
Another trait you see in good DevOps engineers (and I'm stealing this line from someone else) is that they are lazy in a good way! That is to say they hate repetition, and will naturally optimise/ automate the things they do and not get stuck doing rinse/ repeat activities manually.
Finally, I would say that a DevOps engineer needs to be considered and trusted by the engineering teams. That's because they have elevated privileges to environments / tool chains which means there is always the risk that they could do a lot of damage if some something goes wrong. A cool head and a steady hand are essential. We have to trust they know what they're doing!
What would you consider the most impactful developments in DevOps in recent years?
The most impactful thing I can think of in this space would be the Self Service model created by the Cloud Providers and more specifically a programmable Cloud with APIs that has completely transformed how we provision and manage the infrastructure and runtimes for our applications. This in turn has fostered the plethora of infrastructure as code solutions that are available to DevOps to make provisioning infrastructure repeatable, consistent and managed in the same way we release code. Now we can cut a release that contains both code and infrastructure changes all tagged in a git repo.
On the application side, containerisation has lead to teams being able to package their runtime dependencies with their applications and using Kubernetes deploy those applications at scale. This is just not possible without a strong DevOps capability to support it. We also now have DevSecOps, where we are integrating vulnerability scanning and security compliance into our pipelines. For example, we can security scan our docker images within our CI/ CD pipelines and schedule regular scanning/analysis of our environments for vulnerabilities. This will generate insights and lead to action plans
Running on Cloud can have its challenges, what challenges have you experienced and how have you resolved them?
The most common issue we run into while running on cloud is that the network is essentially unreliable. When you are running a microservice based architecture with lots of service-to-service interactions, it can be really impacting when intermittent failures occur. We have spent a lot of time implementing retry logic across our service interactions as typically these network errors tend to be transient and very short lived, but designing for failure is key to running on cloud.
When we implemented retry logic we quickly realised this is only possible if the services are idempotent because a failure doesn't necessarily mean a request failed, in some cases like a timeout, we just don't know. Without idempotency, were services will only ever perform a given action once, you run the risk that a retry performs the action twice which can be much worse than failing. We also introduced the concept of housekeepers that will monitor certain critical processes on the platform and take action if they identify a failure scenario occurring.
The rate of Change of Cloud Services is also an important factor to consider. While you try not to get sticky to your cloud, it's difficult to justify rolling your own solutions for services that come pre-canned by your cloud provider. We use many different Azure services some of which have a high rate of change in the capabilities and supported versions. For example, the Azure Kubernetes Service (AKS) will generally increment major versions twice a year, so keeping within a supported version can be time consuming. For those services (like AKS) we try to isolate them from a provisioning and runtime perspective so that we can build out replacement infrastructure and services alongside the existing platform and swap out old for new instances without impacting the overall platform or incurring downtime. It’s all about seeing the shear points in your architecture and isolating them.
What advice would you give to anyone considering a career in Cloud / DevOps?
If you are considering this as a career then obviously a foundation in good engineering practices will stand you in good stead, and if you like learning new things you won't be disappointed, the rate of change and innovation on the Cloud platforms is head melting at times.
Looking at the market, there is a lack of good DevOps out there to fulfil demand and a good DevOps is worth their weight in gold so it is a really good place to be right now.