With the question of what DevOps is all about out of the way, let’s consider the challenge of setting up an effective DevOps team. I’ll try to present this as a series of questions, which in order hopefully comprise a good overview of the process. I certainly can’t cover everything in s single post, so let’s start with the basics:
Do I even need a DevOps team?
Not necessarily. What you do most certainly need, and probably already have, is an application engineer role. The formal makeup of such a team derives from business necessity: one-man ISVs handle all aspects of the application on their own, including the operational aspects, and are therefore not interesting for the purpose of this discussion. But even with garage startups that comprise very few people (as low as three, in practice), someone always ends up taking the DevOps mantle; there’s always the one guy who ends up taking care of the hosting, deployment and all that other business because someone has to. As your organization grows, as your engineering throughput exceeds your operational capacity and the complexity of your product portfolio increases dramatically, that’s when you need to consider an actual team.
How do I split up the work?
There is no absolute truth here; the answer depends on the size, organizational structure and corporate culture of your specific organization. Your requirements may be satisfied by having just one übergeek take care of everything; in other cases (such as multiple long-term projects, very high system complexity or simply too much work for one person) you may need to divide the work up into more specific agendas. There are numerous strategies you can employ, but here’s a few ideas to get you started:
1. Division by domain
This is a fairly trivial model: figure out what work gets done, categorize it into a reasonably small number of domains, and assign them to specific team members according to their respective skillsets, interests and aspirations. To illustrate, consider the following domains: capacity planning, monitoring, high availability, infrastructure (networking, storage) and deployment. Your engineers probably already handle all of the above.
While this model has the benefit of generally placing the most qualified personnel in charge of any particular domain, be careful to have a least a pair of engineers on each domain so you can scale up when necessary, and to ensure knowledge is retained. In case of small teams (3-5 people) this means each person is likely to have more than a single area of responsibility; depending on your team members, the ensuing overhead (or even personality clash) may have a significant impact on the efficiency of your team.
2. Division by feature
Another fairly simple strategy is to have shared ownership of all the fundamental tools and processes your team employs, and scale by assigning specific features to team members. Not everyone on the team has to have the same level of expertise on every domain, but every team member does need to have a good grasp of all aspects of your work. The level of expertise of any specific team member becomes an evaluation criterion when assigning a new feature, project or issue, and has to be weighed alongside other criteria, such as the need to challenge less-experienced team members, pressing schedules and the other myriad details that managers have to handle. Ideally, any feature can be assigned to any team member and executed effectively.
The primary drawback to this model is efficiency: it is irresponsible to expect the same level of productivity for a specific task from each team member, which makes efficiency the direct responsibility of whomever divides up the assignments, usually the team lead or group manager†. Unfortunately, not all managers are apt at identifying bottlenecks and shifting responsibility around to optimize for efficiency; whether this model suits your organization or not depends on your ability to effectively track your team members’ abilities and provide effective guidance.
A possibly bigger challenge here is knowledge sharing. As the team grows, so does the communication overhead and, perhaps more importantly, the opportunity for miscommunication, necessitating meticulous documentation and proactive knowledge management. The ensuing overhead can kill productivity if not carefully managed.
† Small-enough teams may be able to split the work up by consensus, but success depends on the social composition of the team and is hard to predict.
3. Divide and conquer
The problem can also be attacked from a completely different angle. Your team comprises members with different skills, mindsets and even ambitions; sometimes these do not mesh together naturally, possibly due to too much overlap or, in other cases, too little. This can happen for any number of reasons, but let’s assume that this is not a case of simple ego clash or, as Rands put it, a toxic asset; your team members are hard workers, they like each other well enough and want to get things done, but they just can’t agree on the mechanics. It’s time to divide and conquer.
There are two ways to go about doing this: the simplest is to split up domain ownership between the groups. It’s not likely that anyone will be truly satisfied with your choices (if the division of expertise or interest is natural, there probably wouldn’t be any significant friction in the first place), but the grumbles will eventually subside, if only through relief that the constant bickering is over. When the division of responsibilities is clear, people feel there’s less at stake and are more likely to concede a draw in an argument (don’t expect capitulation here, these are engineers after all…). Once the emotional involvement is out of the equation, engineers are quite rational; if they can’t make a convincing argument and they’re not personally vested in the result, the argument ends with an amicable “yeah, alright, it’s up to you.” Unfortunately, turning this to your advantage in the long-term entails a clever assignment of domains, balanced well enough so that both sides rate their level of interest above the “meh” threshold. How to do that is entirely up to you.
The second way is to introduce a new element into the mix. In particular, if the areas of contention typically resolve around choice of tools, it might be wise to assign ownership of tool selection to a third party; for instance, friction may ensue over the choice of monitoring system. You may experience heated arguments about the relative merits of e.g. Nagios and Zabbix, whereas a level-headed analysis would show that a customized system based on Graphite is your best strategy. Or perhaps the debate is on httpd vs nginx; or any strategic technical decision, really. See what I’m getting at here? At a previous employer we ended up defining an Infrastructure Architect role, an operations counterpart to the System Architect role in R&D. These rolls fill very much the same need, except on opposite ends of the spectrum; whereas the SA might pick Hadoop and log4j for data storage and logging respectively, the IA will likely pick Graphite and LogStash to monitor the cluster and aggregate logs, respectively. Interestingly, the necessary skillset is also quite similar. Identifying and mentoring likely candidates is a subject that I’ll likely tackle in a future installment.
To be continued
There’s much more to say; this is an entirely new branch of the industry, with few acknowledged best practices and even fewer well-known truths. The main point here is that you have to experiment in order to build an effective team, and manage it efficiently. I’d love to hear actual feedback on what works and what doesn’t, and likewise I’d be interested in feedback regarding what I should focus on next: which is more interesting, recruiting techniques and pains, or building an effective interface with the other groups in the organization? Or perhaps an entirely different domain, such as knowledge management techniques? Fire away…