Tips and Tricks: Training Deep Models

Training deep models is the most time-consuming process in developing deep models' journey. Both from a configuration perspective as well as computational complexity. Having gone through the struggle, we listed below a few learnings that we hope will make your experience much smoother!

Start Simple

When training your deep model, the choice of hyperparameters (batch size, learning rate...) is important!

If the task/dataset is new, or the model/approach is significantly different from other work, the best is to go to simple choices; reasonable batch size (e.g. 16, 32), small learning rate (e.g. 1e-5), slight data augmentation (e.g. image flipping, rotation..). Once your model output something meaningful, you can pick better choices.

If u have a certain task (example: semantic segmentation), and you have a certain complex architecture in mind. Don't try training this complex model at first. Instead, try a simple vanilla architecture. You may be surprised that this simple architecture works too well and it only requires some minor tweaks to boost the performance.

Don't Reinvent the Wheel

If you are working on a public benchmark or the task is common (e.g. image classification using ResNet) the best is to follow others; pick one or 2 good papers and try to follow the same choices, especially if you are improving an existing model.
If you are already familiar with a problem you can use fancy hyperparameters/tricks (e.g. some advanced data augmentation, different training paradigm, better model initialization...).
To get familiar with recent "training tricks", one way is to pick a new SOTA in image classification (e.g. transformer) and check their implementation details...
When using a new hyperparameter that is said to boost the performance but it is not doing so. Make sure you are following the author's recommendation. Example: In Swish's paper(https://arxiv.org/pdf/1710.05941.pdf) the authors state: "For training Swish networks, we found that slightly lowering the learning rate used to train ReLU networks works well."

3. Get yourself familiar with repositories you like

There are a lot of repositories out there. you need to explore and see what you like and what works best for you and your model. Examples: ( https://github.com/albumentations-team/albumentations - https://github.com/scikit-image/scikit-image - https://github.com/geopandas/geopandas - https://github.com/rasterio/rasterio ...)

4. Make sure you use Data Augmentation wisely

Random data augmentation is used for training only to ensure the robustness of the model. For val or test, you should not use random augmentation for fair metric calculation at every epoch. For test and val, stick only to Resizing and normalization or other necessary augmentation depending on your method.
You can try augmentation like Flips and rotation on test and val. But you should calculate the scores separately for each of the original/flipped/rotated images.
Extensive data augmentation can be harmful for your model. Be careful when assigning the percentage, intensities and the type of augmentations you add. (Example: [Percentage] Noise addition can be helpful for a more robust model, but u cant apply noise on 50% of you training set. 5%-15% should be enough - [Intensity] RGB shift is a common color augmentation, but let the shift be of reasonable value or you will change the content of the image - [Type] It is unreasonable to use some color augmentations when trying to classify crop types where color is major key in identifying the crop type.)

5. Watch your Pre-Trained Weights

When training a model, first try using pre-trained weights as initialization (ImageNet, COCO ...) instead of random initialization like Xavier or normal initialization. This may seem simple, but it often affects the training speed and convergence greatly.
Sometimes the training data for your task can be much different than the data on which the pre-trained weights were trained on. Example: Training a segmentation model on satellite SAR images with low resolution whereas the pre-trained model was trained to model on Regular RGB images with high resolution. In this case, when training using pre-trained weights, the model might reach an early local minimum. It might be better to start with random initialization.
A lower initial learning rate (0.0001) often works when using pre-trained weights. When using random initialization, you might need to start with a higher learning rate (0.001 - 0.01).