What is «Big Data»?

As more and more of our activities take place on computers, more and more of what we do is recorded.

As our computers are increasingly networked together, it becomes easier to centralize these records and curate them into a dataset appropriate for machine learning applications.

The age of “Big Data” has made machine learning much easier because the key burden of statistical estimation—generalizing well to new data after observing only a small amount of data— has been considerably lightened.

As of 2016, a rough rule of thumb is that a supervised deep learning algorithm:

  • will generally achieve acceptable performance with around 5,000 labeled examples per category,
  • will match or exceed human performance when trained with a dataset containing at least 10 million labeled examples.

Working successfully with datasets smaller than this is an important research area, focusing in particular on how we can take advantage of large quantities of unlabeled examples, with unsupervised or semi-supervised learning.

Goodfellow, Bengio, Courville - «Deep Learning» (2016)