I did a virtual onsite interview with Google for the Systems SRE role in 2022. This post is about my preparation for the interview and my actual interview experience.

In my previous post, I discussed the phone screen I had for the SRE role. After the phone screen, my recruiter got in touch with me, discussed the results of the interview and handed me over to another recruiter who handled the onsite interviews. I had a call with this new recruiter and we discussed the timeline and the topics I needed to prepare for. Compared to the software engineer roles at Google, there aren’t a lot of resources available online to help with preparation for the Google systems SRE role. I found these posts to be really useful:

  1. Dan Luedtke’s blogpost on interviewing for the SRE manager role
  2. Fabrizio Waldner’s post
  3. Oğuz Yarımtepe’s post

The Systems SRE interviews have the following components:

Practical Coding and Scripting: Link to heading

I prefer coding in Python and I communicated this to the recruiter. For the systems SRE role, you do not have to worry about learning complex algorithms or data structures like trees and graphs. They expect you to write code with emphasis on resource utilization. I read up on Python’s memory management principles and practiced some easy and medium coding questions in Leetcode.

The Blind 75 list is a big help. I would also recommend practicing coding to recreate common systems functions like file traversal or log processing or file parsing. Again, writing memory efficient code is key.

During my interview, I was asked to write a system function using Python. I used the file object as an iterator to avoid loading the entire file into memory, but the interviewer felt that I could use the sliding window technique to read the contents of the file.

Non-abstract large scale systems design: Link to heading

This one is unique to the Google SRE role. Most SRE interviews with other companies ask you to design a system, you talk about the design scope, you give a high-level design and you do a deep dive of the design and the components, including performing a few back-of-the-envelope calculations, like for example to estimate the amount of storage required for a service. With the Google SRE NALSD interview, you go deep! You design a system that is scalable and you are expected to provde a detailed estimate of the amount of storage and traffic you expect your web service to experience. You also talk about the number of servers that you need to meet those storage and traffic demands. Dan Luedtke perfectly summarizes this when he mentions that you have to come up with a Bill of Materials (BOM) for the design that you are proposing.

For preparation, I used the following for basics:

  1. System Design Primer
  2. System Design Interview – An insider’s guide by Alex Xu There are also multiple YouTube videos dedicated to the basic components of a modern production system, scalability, and reliability. In addition to these resources and specifically for the Google NALSD interview, I recommend the following resources:
  3. SRE Classroom - Go through both the exercises and the video lessons.
  4. The slides for the presentation on the Distributed Image Server in the SRE classroom. I found this presentation to be very useful in that it showed me a sample of the BOM calculations.
  5. Reading lists like this

I felt like I did a good job with the estimations for storage, traffic, and amount of servers, but I made some mistakes with my system design by adding a component that was a single point of failure. As with any other skill, practice is the key to acing this interview. Take your time, understand the basics, learn to make quick back-of-the-envelope estimations, and make sure to think about all of the failure modes of the system you are designing.

Linux internals: Link to heading

I learnt the hard lesson that how much ever experience I was able to accumulate during my career with troubleshooting Linux systems, there was still a considerable amount of concepts related to the Linux kernel that I had no clue about. Robert Love (who also works for Google) wrote the book “Linux Kernel Development” which helps a lot with preparation for this interview. I also recommend Michael Kerrisk’s “The Linux Programming Interface” book to brush up on the concepts. The basics of a Linux system are how processes are handled, how memory is allocated, how file I/O works, how the system maintains time, how signals work, how pipes work, how sockets work. Dive deep into each of these areas and learn how the kernel handles them to help with this interview.

Linux Troubleshooting: Link to heading

The meat and potatoes of any SRE’s work! My mentor at my previous job recommended Brendan Gregg’s Linux Performance Analysis in 60,000 Milliseconds blog post when I first joined the team. I haven’t found a better resource on the steps to follow to troubleshoot a system. However, like all other skills, you get better at troubleshooting by practice. With experience and with a few production issues, you learn the finer concepts of troubleshooting. You learn not to shave the yak and not to go on a wild goose chase based on metrics that are not relevant to the issue. I made the mistake of not bothering to check the application logs for my question which could have helped me identify the issue sooner. I realized it towards the end after exhausting a whole range of other scenarios.

Behavioral Interview: Link to heading

For this interview, I gathered a list of possible behavioral questions from Youtube. I then listed down all of the projects I had worked on in my professional experience and used them as examples to answer each of the questions based on the STAR method. For the list of questions, this blogpost by Molly Struve is one of the resources I used.

Closing Thoughts: Link to heading

I have been in awe of the SRE team at Google ever since I read the Google SRE book. This was my third attempt at interviewing with Google and the first time I made it to the onsite rounds. I did not clear the interview. In my opinion, I did not do well in the systems internals interview, I made a couple of mistakes with both my Linux troubleshooting and NALSD interviews.

The experience was great, the recruiters and schedulers are very friendly and the interviewers are very professional and proud of their work. The phase of life related to preparation and the interviews was a good learning experience. I hope to get another shot at the interview and this time, I hope to prepare better, learn from the mistakes I made in the past, and clear the interviews the next time.