When interviewing higher level software engineers, it’s good to assess not just ability to write code, but also to design systems. To accomplish that, I used to ask people to design something, such as an elevator control system. However, that leaves some gaps. For example, if you ask them to design a distributed storage system, they may decide to use a two phase commit. But do they remember why? How do simpler schemes fail? Also, when designing with a group of engineers, are they able to give useful feedback to other people?
So instead, I give them naive designs, and ask them to list the designs’ pros and cons of each. For example:
I want you to help the team design a distributed storage system. I’ll play the role of the team, suggesting designs, and I’d like you to give feedback on the pros and cons of the designs.
Let’s design a distributed file system. Companies like Facebook and Google don’t use centralized storage, instead they have lots of computers each with commodity disks, and if one fails, they can continue serving requests without losing data. They have an API that makes the disks look like a single block device with a single, unified address space.
So let’s say the team comes to you with this design:def write(int address, byte data): Store the data on the computer where the write takes place. def read(int address): Lookup the address on the computer where the read was issued. If we find it, return the data. Otherwise, broadcast a message to all computers, "please send me the data for this address". If any computer has it, return that data. If no computers have it, return "not found."
I then draw a diagram with four computers, and work through an example where one of the computers writes a block, then another computer reads it.
It’s interesting to see what people list as pros and cons. Many talk about the high cost of broadcasting to all nodes. Few catch the potential data inconsistency. The simplicity is a pro, but lack of redundancy is a con.
When they point out that there’s no redundancy, I suggest writing to two machines independently. If both succeed, then we return “success,” otherwise, we return “fail.” Can they spot the problems? Can they point out pros like that disks will be fill up evenly, or will they focus only on cons?
I find that asking people to critique designs gives me a better idea of how they understand a system. Do they just throw around buzz words and boxes, or can they think through the details? Do they have a sense for where race conditions might hide? (Hint: multiple computers touching the same data; things finishing in a non-intuitive order; nodes failing.) Can they actually work through those examples to find the problem? Can they understand that every design has pluses and minuses, or only focus on the parts they don’t like?
And I can also use it as a background for asking: Do you facilitate discussions and encourage team members to come up with their ideas? Or do you see generating designs and architectures as your job alone?