Pure-MNIST — a network with nothing to hide

The problem

Frameworks make backpropagation easy to use and easy to never understand. The point of this build was to be unable to hide behind loss.backward().

Approach

A 784–128–10 network implemented as a modular layer API in NumPy only: DenseLayer owns the weights and the Y = XW + b forward pass, ReLU handles the dead-neuron gradient mask, and Softmax is fused with cross-entropy for numerical stability. Weights start from He initialization; every gradient is derived by hand from the chain rule and applied through vectorized matrix updates — no autograd anywhere.

Results

[Test accuracy from the training run — and how close it lands to an equivalent PyTorch baseline.]

What broke

[The gradient bugs you found and how you caught them — gradient checking? exploding ReLU? learning-rate cliffs?]