You can do it in a week yourself, why wait for 2018?
This is what I would suggest: as basic at it gets, yet an actual NN, and an extensively studies problem.
input: "x" vector of 784 floats. This is where you input the image pixel intensities.
hidden layer: 256 unit. It has a weight matrix W1 of size 784x256 and a bias vector b1 of length 256. These are variables you need to optimize. The output of the hidden layer units is
h = max(W1*x + b1, 0)
max is the easiest "activation function" from a gradient perspective, works good in practice, and is called RELU. So it's a linear transform followed by a non-linear transform. Very easy: matmul, sum, max.
the output of the NN is going to be 10 class probabilities. What digit do we see in the image? Similar to the hidden layer you have a linear transform followed by non-linear transform,
y = sm(W2*h + b2)
W2 is a 256x10 matrix, b2 a vector of length 10. You also optimize these.
sm(x) is the softmax function.
https://en.wikipedia.org/wiki/Softmax_function If transforms the vector of 10 floats into 10 probabilities (all in the range 0-1 and the sum == 1).
...that's the "forward pass" it specifies how to convert inputs to outputs.
for the backward pass you need to define a loss function. The image was e.g. a "9" but the NN said p=[0.1, 0.05, .. 0.03]. How do you convert that mismatch to a number? You can then differentiate that loss function towards W1,b1,W2,b2 and get you gradient.
Do you need pointers with equations for the backward pass?