I outlined how communication challenges can lead to problems in IT organizations in my last post. In this follow-up, I want to examine how infrastructure and development teams can work together better.
A colleague of mine sent me this article about how Google manages to achieve the balance between the Development and the Infrastructure team. Being able to achieve a harmonious existence is hard enough, talk less about doing this at such a large scale. Here are some key take-aways and some important advice on how the process of working together can be improved.
1. We’re all friends here: Create a shared goal
While a Sysadmin’s primary goal is uptime, developers want new features, bug fixes, in short: as many releases as they can produce. This is the primary source of friction. Sysadmins are happy to put freezes in place to preserve uptime, and Developers are sometimes trying to get around Sysadmins to get their job done. A first step towards overcoming that friction is a shared vision and goal.
Shared goal and vision - easier said than done. A goal requiring input from both teams is key to creating a shared sense of ownership. What will follow is the realization that sometimes one will have to do something that would not normally be his/her job. As an infrastructure team member, I learnt a little bit of coding and scripting. Not that I would ever be asked to code, however, being able to troubleshoot simple syntax errors won me a lot of respect and enabled the developers to focus on more interesting things like debugging and writing new code. This in turn meant when I did have an issue that needed fixing, it was clear that I had done all the routine troubleshooting (and even documented it :) )
In the Google approach, which is called SRE (Site Reliability Engineering) – described as what happens when a software engineer is tasked with what used to be called operations – works because, surprise surprise, when people who know the software inside out are designed in the design and management, services run better.
By having a common goal and understanding of how ‘the other guys’ work, it fosters unity and is a great starting point. This approach will require more technical breadth than is normally required for members of either team, but the reward for these individuals is clear - more lines on your CV at the very least, better employability at best. And the end result is more robust design and more reliable services for the business.
2. It worked on my machine!
When new code goes into production, you expect to see errors. You can test as much as you like, but production normally exposes problems you never thought about. And when problems come up, you want to be able to rule out the infrastructure.
It can be expensive, but there’s huge value from making sure that developers have access to an environment that is identical or similar as possible to the production environment.
3. Plan to fail. But don't fail to plan
In the end, outages will occur. Systems will fail. Software will crash. You have to accept this fact.
In the Google SRE approach, they have an ‘error budget’ to accommodate this. The goal should be to focus on restoring service first. Troubleshooting can come later. And as much as possible, try not to assign blame, even after the event. ‘Blameless Port Mortems’ are a good idea, the focus should be on learning how to prevent future outages.
One of my pet peeves has always been undecipherable log files, in order to aid troubleshooting, developers should as much as possible improve logging. Basically, when things go wrong, the log files should give adequate information to point anyone investigating in the right direction. This will no doubt be an iterative process, with improvements made as the system matures. The Infrastructure team can make sure that there is adequate data on the systems, possibly from monitoring tools, to review after the fact.
Automate, Automate, Automate!
As much as possible, anything that can be automated should be. If a task is done frequently, is time consuming, is repeatable and requires some level of standardization, it should be a candidate for automation. A good example would be build and/or deploy processes.
A good engineer gets bored quickly, and would rather be working on something strategic or interesting. Time is a limited resource; automation can give the whole team time back to focus on what really matters.
How do you identify a process that can be automated? Start with anything that already has a crib sheet or a checklist. If engineers are frequently referring to any document when running this process, then it can probably be automated with a some effort, especially if it is easy to prove how much time can be saved going forward. You need to make sure that any time invested in development will be recouped – the table below (from xkcd no less) actually gives a good guide.
In summary, it is very important to get everyone on the same page. Synergy is key, but with the right attitude, processes and leadership, getting the IT teams to work together will lead to more stable environments and the ability to deploy software faster. In the end, the business benefits when all the cogs are working together.