blog_

Hello 👋,
I am an engineer at heart (about me), focused on thoughtful creation with an emphasis on throughput, availability, and correctness.

You can trace my mind dump here:

INTRO: How to Build Your Own Version of NumPy

Python is expressive and flexible but slow for heavy numeric computation. Libraries like NumPy combine Python with fast compiled backends to speed up arrays and matrix operations. With Pybind11, you can write C++ or CUDA code, expose it to Python, and package it cross-platform. In this post, we’ll cover building vectorized CPU functions with SIMD and OpenMP, GPU acceleration, compiling .so modules, distributing via wheels, and understanding the Python ABI. These examples serve as notes for anyone curious about how CuPy, a GPU-powered NumPy alternative, works under the hood. ...

NOTES: Memory of Forgotten

Love, CPU Memory Access, FreeBSD Project - https://people.freebsd.org/~lstewart/articles/cpumemory.pdf HeMem: Scalable Tiered Memory Management for Big Data Applications and Real NVM, SOSP 2021 - https://doi.org/10.1145/3477132.3483550 (Technion) Assise: Performance and Availability via Client-local NVM in a Distributed File System, OSDI 2020 - https://www.usenix.org/system/files/osdi20-anderson.pdf (USENIX) Don’t Be a Blockhead: Zoned Namespaces Make Work on Conventional SSDs Obsolete, HotOS 2021 - https://doi.org/10.1145/3458336.3465300 (ACM SIGOPS) ...

NOTES: training the distributed scale?

Key definitions Data parallelism: every GPU (worker) holds a full copy of the model and processes a different slice of the training data; after computing gradients they are combined (synchronised). Model parallelism: the model itself is split across devices (layers or parameters), so different GPUs handle different parts of the model. Parameter / optimizer state / gradient: model parameters = weights; gradients = derivatives computed in back-prop; optimizer state = extra per-parameter data (momentum, Adam’s moments). Collective communication / all-reduce: when multiple devices coordinate data exchange, e.g., an all-reduce sums up a tensor across all devices then distributes the result back. Sharding: dividing a tensor or state among workers so each stores only a part (instead of full replication). Offloading: moving data (parameters, states) to slower but larger memory (CPU RAM or SSD/NVMe) when it isn’t actively needed on GPU. 1. PyTorch Distributed Data Parallel (DDP) There is a problem of training large deep-neural-network models efficiently across many GPUs, so we used DDP, which replicates the model on each GPU, processes different samples in parallel, and then synchronises gradients via collective operations, achieving high scalability. The main takeaway / innovation / addition / insight is efficient overlap of computation and communication - it was implemented through gradient “bucketing” (grouping gradient tensors) and overlapping gradient reduction during backward pass, and enabled by the process-group abstraction (in torch.distributed) and collective backends such as NCCL. The second takeaway is using a process-group abstraction that hides details of inter-GPU communication - implemented via torch.distributed.ProcessGroup, and enabled by NCCL or MPI based libraries that manage all-reduce/ broadcast under the hood. The third insight is reducing redundant copies of model parameters and buffers across processes - rather than each process fully copying every buffer, DDP uses shared storage for model parameters in multi-process mode (via torch.multiprocessing + shared CUDA memory) so that intra-node clones reuse memory rather than duplicating it. And the last additional interesting fact is that DDP achieves near-linear scalability (up to ~256 GPUs) when configured properly (bucketing, overlapping) as shown in the PyTorch paper. (arXiv) ...

NOTES: vSpiders or how to virtualize a network

Cloud networking is about squeezing determinism from chaos - hundreds of thousands of tenants, each expecting a private, secure, and high-performance network that doesn’t even “feel” shared. This post takes you from the ground up: we’ll start with what virtualization means in networking, explain overlay networks, fabric topologies, and control planes, and then we’ll walk through three real systems: (1) Koponen et al., Network Virtualization in Multi-Tenant Datacenters, NSDI 2014 - PDF from USENIX (2) Dalton et al., Andromeda: Performance, Isolation, and Velocity at Scale, NSDI 2018 - PDF from USENIX (3) SNAP: Snap: a Microkernel Approach to Host Networking, SOSP / Google - Official web page (includes PDF) at Google Research ...

NOTES: GPU Architecture, TVM, and Triton

If TensorFlow showed us how to scale ML across entire data centers, and PyTorch 2.0 showed us how to compile dynamic Python code, GPUs and compiler stacks like TVM and Triton show us how to squeeze the absolute last drop of performance out of hardware. This post is my attempt to tie together GPU architecture, TVM’s tensor-level compilation, and Triton’s clever tiling JIT strategy, all in one nerdy, slightly dorky package. ...