“My utilization is too high! My latency is increasing, I’ve got too many cache misses, my search results aren’t relevant, I’m way above the threshold and all my alarms are going off, especially in the middle of the night.”
Ever feel this way as an engineering manager – that your compute and storage resources aren’t cutting it, but you haven’t figured out how to turn on auto-scaling? I feel it, and I hear it from so many of my fellow eng managers.
I noticed some distributed systems analogies could be in order to help me think about this…
- Front end, user interface: meetings, meetings, meetings
- API layer: email, slack, text, DMs, code review, gdoc comments
- Database/storage layer: paper notebooks, Evernote, gmail, google docs, spreadsheets, iOS notes
- Compute layer: multitasking/multithreading, background processes; certain processes require all the CPU, certain processes cannot be multithreaded (see the book Thinking Fast and Slow by Daniel Kahneman for more on System 1 and System 2 thinking)
- Network: packet loss through unread messages, incomplete todo items, missed deadlines, poor memory
- Firewall: executive assistant arranging and declining meetings, deflecting unnecessary requests and email
- Cache: LRU caching, in that most recently used items are fast to access, older items need to be fetched from storage (see Database/storage layer)
- Latency: length of time it takes to respond to email, slack, text, DMs, code review, gdoc comments (and hey we probably need to define some SLAs here)
- Availability: Time in the office = 365/24/27 – (PTO + nights/weekends + training); PTO = system maintenance, training = system upgrade.
- DDoS attacks: relentless recruiters, sales people
- Search: complexity increases with the number of Database/storage layer systems employed
- Monitoring, observability: health checkups, heart rate monitoring, blood tests
- Power systems: sleep, food, exercise
I’m sure my fellow EMs and engineers can think of lots more parallels. So how might these concepts help me design a more robust management “service?” What do I need to deprecate? Where do I need more resiliency? What do I need to upgrade? Can I do some performance tuning? Hm, maybe what I need is an embedded management SRE!