Hi there 👋

Engineering topics from Kubernetes, Databases to AI to virals to solving problems. Welcome!

RAS 2026: A Comparative Analysis of Cloud Providers for AI Workloads

Introduction As artificial intelligence continues to reshape industries, the underlying infrastructure powering these innovations is more critical than ever. For AI workloads, especially large-scale training and inference, the trifecta of Reliability, Availability, and Serviceability (RAS) is paramount. These three pillars determine the robustness, uptime, and maintainability of a system, directly impacting the performance and cost-effectiveness of AI applications. In this post, we’ll explore the RAS landscape in 2026, comparing two distinct categories of cloud providers: the specialized clouds (Crusoe, Nebius, CoreWeave) and the hyperscalers (AWS, GCP, Azure)....

Debugging Slow Nodes in an InfiniBand Network

In the world of High-Performance Computing (HPC), an InfiniBand network is the backbone that ties together numerous nodes to work in concert. The performance of the entire cluster is predicated on each node and connection operating at its peak. However, a single underperforming node can create a bottleneck, drastically slowing down the entire fabric. This post will explore a systematic approach to identifying, isolating, and repairing these slow nodes using a binary search algorithm and NCCL (NVIDIA Collective Communications Library) benchmarks....

The Steps for Gemini Client to Upload This Blog Post From a Single Prompt

In the world of content creation and management, automation is key. This very blog post is a testament to the power of AI and automation, as it was generated and published by Google’s Gemini in response to a single prompt. Here’s a look at the steps involved in this process: 1. The Prompt It all started with a simple instruction: “in blog create a new hugo markdown file matching the title of the article with a dash between words....

Unlocking Simplicity: Embedding Executables in Go for Robust Deployments

As a Director of Engineering, I’m always looking for ways to simplify our development and deployment processes. The more we can reduce external dependencies and streamline our tooling, the more reliable and maintainable our systems become. One of the most elegant solutions I’ve seen for this challenge comes from the Go ecosystem: embedding executables directly inside a Go binary. This technique is a game-changer for building self-contained applications. Imagine deploying a single, static binary that contains not only your core application but also all the necessary command-line tools or other executables it depends on....

Mapping Datacenter With Events

Problem In a datacenter the goal is to understand everything about the datacenter without actually having to be in the datacenter. How is this acheived? Engineers view of Hardware Most engineers don’t care where the physical placement of servers are, unless they are on-call for hardware issues and are asked to determine why a class of servers are all having errors, such as tempature errors. Knowing how servers are laid out helps with the RCA....

Harnessing Rituals for Uniterrupted Uptime

Introduction In the ever-evolving landscape of site reliability engineering (SRE), ensuring the highest levels of uptime is the ultimate goal. Behind the scenes, a symphony of processes and practices come together to achieve this, and at the heart of it all are the rituals that we, as a large SRE team, follow with unwavering dedication. In this blog post, we’ll dive deep into the rituals that shape our approach to maintaining uptime, shedding light on how these practices contribute to a resilient system and empowered team....

Focus by Answering What Problem Are We Solving

Problem One of the most effective ways that has helped me focus is answering the question, “What problem are we solving?”. For this post, the problem that I am solving is how to communicate this idea to readers so they too can focus on what to solve. When there are 100 things to fix, what do you fix? One of the better articles that I read about focusing on a problem https://jasonfeifer....

Engineering Methods Based on Fictional Characters

Problem What are effective strategies for an engineer to communicate with their peers and answer their inquiries efficiently? Solution Appeal to your fellow engineers through shared cultural touchstones, like the well-known characters from science fiction that most engineers are familiar with. Case: Time Estimation in Software Engineering Every software engineer inevitably has to provide an estimate for when a code block, task, or project will be completed. Defining what “done” means can be complex....

Postmortem of a 2005 Flickr Outage Modernized for Today

Review of Flickr Architecture Before we delve into the Postmortem let’s review the details of the database architecture and Flickr’s overall Architecture at that time. Flickr at that time was a PHP shop, running PHP 4, using Smarty as its template Engine in Apache running MODPHP. PHP was everyplace because it’s a great engine. I am sure that many hate this statement, that PHP is X or PHP is Y but in reality, it enabled Flickr to jump from a top 100 destination in the USA to a top 5 destination in less than 2 years....

Using Playwright With Golang and Python to Grab Js Generated Csv

Problem I am helping out a friend, to process reports which are behind a paywall and store them in a database to create a churn detector, which will enable his crew to address the churn at hand. Details What is churn? Churn is a term used by companies to denote that a customer no longer wants to use the services they previously paid for. Each company has different metrics to give an indication that a customer is CHURING....