Yosinski, Clune, Bengio, Lipson - «How transferable are features in deep neural networks?» (2014)

dmitrii_fediuk · May 21, 2019, 11:14am

Some citations

Many deep neural networks trained on natural images exhibit a curious phenomenon in common: on the first layer they learn features similar to Gabor filters and color blobs.
Such first-layer features appear not to be specific to a particular dataset or task, but general in that they are applicable to many datasets and tasks.

This phenomenon occurs not only for different datasets, but even with very different training objectives, including supervised image classification, unsupervised density learning, and unsupervised learning of sparse representations.

Because finding these standard features on the first layer seems to occur regardless of the exact cost function and natural image dataset, we call these first-layer features general.
On the other hand, we know that the features computed by the last layer of a trained network must depend greatly on the chosen dataset and task. We thus call the last-layer features specific.
If first-layer features are general and last-layer features are specific, then there must be a transition from general to specific somewhere in the network.

This observation raises a few questions:

Can we quantify the degree to which a particular layer is general or specific?

Does the transition occur suddenly at a single layer, or is it spread out over several layers?

Where does this transition take place: near the first, middle, or last layer of the network?

We are interested in the answers to these questions because, to the extent that features within a network are general, we will be able to use them for transfer learning.
In transfer learning, we first train a base network on a base dataset and task, and then we repurpose the learned features, or transfer them, to a second target network to be trained on a target dataset and task.
This process will tend to work if the features are general, meaning suitable to both base and target tasks, instead of specific to the base task.

When the target dataset is significantly smaller than the base dataset, transfer learning can be a powerful tool to enable training a large target network without overfitting.

The usual transfer learning approach is to train a base network and then copy its first n layersto the first n layers of a target network.
The remaining layers of the target network are then randomly initialized and trained toward the target task.
One can choose to backpropagate the errors from the new task into the base (copied) features to fine-tune them to the new task, or the transferred feature layers can be left frozen, meaning that they do not change during training on the new task.

The choice of whether or not to fine-tune the first n layers of the target network depends on the size of the target dataset and the number of parameters in the first n layers.

If the target dataset is small and the number of parameters is large, fine-tuning may result in overfitting, so the features are often left frozen.

On the other hand, if the target dataset is large or the number of parameters is
small, so that overfitting is not a problem, then the base features can be fine-tuned to the new task to improve performance.

Of course, if the target dataset is very large, there would be little need to
transfer because the lower level filters could just be learned from scratch on the target dataset.