The batch size is the number of input data values that you are introducing at once in the model.
It is very important while training, and secondary when testing.
For a standard Machine Learning/Deep Learning algorithm, choosing a batch size will have an impact on several aspects:
- The bigger the batch size , the more data you will feed at once in a model.
Thus, RAM memory consumption will be almost linear with batch size, and there will always be a limit based on your system specs and the size of your model above which your model will overflow.
- The bigger the batch size , the faster you will loop over your dataset N times to perform training.
- A bigger batch size will slow down your model training speed , meaning that it will take longer for your model to get one single update since that update depends on more data.
- A bigger batch size will have more data to average towards the next update of the model, hence training should be smoother: smoother training/test accuracy curves .
Note that the size of the data is only related to the batch size in the sense that the bigger the data, the smaller the maximum batch size becomes (limit set by RAM).
The size of the model also has a similar relation.
In practice, you should follow “in powers of 2 and the larger the better, provided that the batch fits into your (GPU) memory”.
Minibatch sizes are generally driven by the following factors:
- Larger batches provide a more accurate estimate of the gradient, but with less than linear returns.
- Multicore architectures are usually underutilized by extremely small batches.
This motivates using some absolute minimum batch size, below which there is no reduction in the time to process a minibatch.
- If all examples in the batch are to be processed in parallel (as is typically the case), then the amount of memory scales with the batch size.
For many hardware setups this is the limiting factor in batch size.
- Some kinds of hardware achieve better runtime with speciﬁc sizes of arrays. Especially when using GPUs, it is common for power of 2 batch sizes to offer better runtime.
Typical power of 2 batch sizes range from 32 to 256, with 16 sometimes being attempted for large models.
- Small batches can offer a regularizing effect (Wilson and Martinez, 2003), perhaps due to the noise they add to the learning process.
Generalization error is often best for a batch size of 1.
Training with such a small batch size might require a small learning rate to maintain stability because of the high variance in the estimate of the gradient.
The total runtime can be very high as a result of the need to make more steps, both because of the reduced learning rate and because it takes more steps to observe the entire training set.
My experience is exactly the same.
Lower batch size (either nominally lower with the same number of GPUs or effectively lower because of smaller number of GPUs) results in worse results, even if I train long enough to compensate for the lower batch.
I remember people working with Nematus or OpenNMT were surprised by this behavior of Transformer/T2T because their experience was that lower batch size leads to better results in the end (but slower of course, thus they sometimes start training with big batch and then switch to lower batch for fine-tuning).