Day 5: I Built PyTorch's Autograd (And Finally Understood How AI Actually Learns)

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • MyrinNew
    Senior Member
    • Feb 2024
    • 5175

    #1

    Day 5: I Built PyTorch's Autograd (And Finally Understood How AI Actually Learns)

    From Web to ML Research Engineer: Day 5 of 60


    Today was a breakthrough day. After four days of wrestling with linear algebra fundamentals, I finally tackled the mathematical machinery that makes modern AI possible: matrix calculus and automatic differentiation.


    If you've ever wondered how neural networks actually compute gradients for millions of parameters, or why PyTorch's loss.backward() is pure magic, this post is for you.


    The "Aha!" Moment 💡

    It hit me around hour 6 today: automatic differentiation isn't just a neat programming trick—it's the mathematical foundation that makes training GPT-4 computationally feasible.


    Without AD, training a model with 175 billion parameters would require computing gradients by hand or using finite differences. That's not just impractical—it's impossible.


    What I Built Today

    1. A Matrix Calculus Calculator

    First, I implemented functions to compute common matrix derivatives:






    def gradient_quadratic_form(A, x):
    """Compute ∇(x^T A x) = (A + A^T)x"""
    return (A + A.T) @ x

    def gradient_linear_form(A, x):
    """Compute ∇(x^T A) = A^T"""
    return A.T







    Why this matters: Every neural network loss function involves these operations. The mean squared error? Quadratic form. Linear layers? Matrix multiplication derivatives.


    2. A Mini Automatic Differentiation System

    This was the real challenge. I built a simplified version of PyTorch's autograd:






    class Variable:
    def __init__(self, data, grad_fn=None):
    self.data = data
    self.grad = None
    self.grad_fn = grad_fn
    self.requires_grad = True

    def backward(self):
    """Reverse mode automatic differentiation"""
    if self.grad is None:
    self.grad = np.ones_like(self.data)

    if self.grad_fn is not None:
    self.grad_fn.backward(self.grad)







    The magic: This 10-line class can compute gradients for arbitrarily complex functions. It's the same principle that powers all modern deep learning frameworks.


    3. A Function Minimizer

    Finally, I put it all together to minimize the famous Rosenbrock function:






    def rosenbrock(x, y):
    """The 'banana function' - notoriously difficult to optimize"""
    return 100 * (y - x**2)**2 + (1 - x)**2







    My optimizer found the minimum at (1, 1) in just 847 iterations. Not bad for a from-scratch implementation!


    The Mathematics Behind the Magic

    Forward Mode vs Reverse Mode

    This was the key insight I gained today:


    Forward Mode AD: Compute derivatives alongside function values
    • Efficient when you have few inputs, many outputs
    • Think: sensitivity analysis for engineering


    Reverse Mode AD: Compute function first, then derivatives backward
    • Efficient when you have many inputs, few outputs
    • Think: neural networks with millions of parameters but one loss value


    Neural networks use reverse mode because we typically have:
    • Millions of parameters (inputs)
    • One loss value (output)


    The Chain Rule in Matrix Form

    The breakthrough moment was understanding how the chain rule extends to matrices:






    If z = f(y) and y = g(x), then:
    ∂z/∂x = (∂z/∂y) · (∂y/∂x)







    This simple rule, when applied recursively, enables backpropagation through arbitrarily deep networks.


    Real-World Applications

    Why This Matters for Transformers

    Every attention mechanism in GPT involves:

    1. Matrix multiplications (Q, K, V computations)
    2. Softmax operations (attention weights)
    3. Weighted combinations (attention output)


    Each of these requires matrix calculus to compute gradients efficiently.


    The Computational Revolution

    Before automatic differentiation:
    • Manual gradient computation → Error-prone and slow
    • Finite differences → Numerically unstable
    • Symbolic differentiation → Exponentially complex


    After AD:
    • Exact gradients computed efficiently
    • Arbitrary function complexity handled automatically
    • Scalable to billions of parameters


    The Debugging Journey

    Building AD from scratch taught me how fragile these systems can be:


    Gradient Checking

    I implemented numerical gradient checking to verify my analytical gradients:






    def gradient_check(func, x, analytical_grad, h=1e-7):
    """The gold standard for gradient verification"""
    numerical_grad = np.zeros_like(x)
    for i in range(len(x)):
    x_plus, x_minus = x.copy(), x.copy()
    x_plus[i] += h
    x_minus[i] -= h
    numerical_grad[i] = (func(x_plus) - func(x_minus)) / (2 * h)

    return np.allclose(analytical_grad, numerical_grad, atol=1e-6)







    Common Pitfalls I Discovered

    1. Dimension mismatches in matrix operations
    2. Forgetting to transpose in derivative computations
    3. Accumulating gradients incorrectly in reverse mode
    4. Numerical instability with large step sizes


    Each bug taught me something fundamental about how gradients flow through computational graphs.


    The Connection to Modern AI

    Why This Enables Large Language Models

    Training GPT-4 involves:
    • 175 billion parameters to optimize
    • Trillions of operations per training step
    • Exact gradients for each parameter


    Without efficient automatic differentiation, none of this would be possible.


    The Performance Implications

    My simple implementation processes ~1,000 operations per second. PyTorch's highly optimized C++ backend with CUDA acceleration processes millions of operations per second.


    But the mathematical principles are identical.


    Visualizing the Learning Process

    I created visualizations showing:
    • Gradient fields for different functions
    • Convergence paths of the optimizer
    • Loss landscapes in 3D


    The most striking insight: optimization is literally following the steepest descent down a mathematical landscape.


    Tomorrow's Challenge

    Day 6 focuses on probability theory and Bayesian inference—the mathematical foundation for:
    • Uncertainty quantification in ML models
    • Bayesian neural networks
    • Variational inference techniques
    • MCMC sampling methods


    Key Takeaways

    1. Automatic differentiation is the unsung hero of modern AI
    2. Matrix calculus is everywhere in deep learning
    3. Reverse mode AD is why neural networks scale
    4. Implementation teaches you the fundamentals better than any textbook
    5. Gradient checking is essential for debugging AD systems


    The Meta-Learning Lesson

    Building these mathematical tools from scratch is giving me something that watching tutorials never could: deep, intuitive understanding of how AI systems actually work.


    When I eventually implement transformers from scratch, I'll understand not just the "what" but the "why" behind every mathematical operation.





    Day 5 Complete: Matrix calculus ✓, Automatic differentiation ✓, Function optimization ✓


    Next up: Probability theory and the mathematical foundations of uncertainty in AI systems.


    The journey from Web3 developer to ML researcher continues. Each day builds on the last, and I'm starting to see how all these mathematical pieces will eventually connect into the complete picture of modern AI.





    What's your experience with automatic differentiation? Have you ever implemented gradient computation from scratch? Drop a comment below—I'd love to hear your insights!





    Follow my 60-day journey from Web3 to ML Research Engineer. Tomorrow: Probability theory and Bayesian inference!




    More...
Working...