One last piece of the puzzle is understanding how weight filters operate as pattern matchers, which is how convolution filters are often described.
Why is it the case that the more the values of f(x) around a point resemble the filter weights —and, again, f(x) here are pixel values, or values from a lower network layer — the higher the value of a convolution at a point?
The simple answer is that discrete convolution is equivalent to taking a dot product between the filter weights and the values underneath the filter, and, geometrically, dot products measure vector similarity.
Output-centered convolution works by taking a vector of a weights, and a vector of input values, and multiplying and summing aligning entries: this is exactly calculating a dot product.
It’s certainly true that you can increase the value of a dot product by increasing the magnitude of one or both vectors, but for a fixed magnitude, the maximal value of the dot product is achieved when the vectors are pointing in the same direction, or, in our case, when the intensity pattern in the pixel values matches the high and low weights in a weight filter.