Convolutional neural networks are multilayer perceptrons with a special structure.
CNNs have repetitive blocks of neurons that are applied across space (for images) or time (for audio signals etc).
For images, these blocks of neurons can be interpreted as 2D convolutional kernels, repeatedly applied over each patch of the image.
For speech, they can be seen as the 1D convolutional kernels applied across time-windows.
At training time, the weights for these repeated blocks are 'shared', i.e. the weight gradients learned over various image patches are averaged.The reason for choosing this special structure, is to exploit spatial or temporal invariance in recognition.
For instance, a "dog" or a "car" may appear anywhere in the image.
If we were to learn independent weights at each spatial or temporal location, it would take orders of magnitude more training data to train such an MLP.
In over-simplified terms, for an MLP which didn't repeat weights across space, the group of neurons receiving inputs from the lower left corner of the image will have to learn to represent "dog" independently from the group of neurons connected to upper left corner, and correspondingly we will need enough images of dogs such that the network had seen several examples of dogs at each possible image location separately.On top of this fundamental constraint, many special techniques and layers have been invented specially in the CNN context.
So a latest deep CNN might look very different from a bare bones MLP, but the above is the difference in principle.