Implementing YOLOV1 from scratch using Keras Tensorflow 2.0
Checkout, AI powered Markdown Editor for tech writers
In this notebook I am going to implement YOLOV1 as described in the paper You Only Look Once. The goal is to replicate the model as described in the paper and in the process, understand the nuances of using Keras on a complex problem.
Data Preprocessing
I would be using VOC 2007 dataset as its size is manageable so it would be easy to run it using Google Colab.
First, I download and extract the dataset.
Next, we process the annotations and write the labels in a text file. A text file is easier to consume as compared to XML.
2007 train
2007 val
2007 test
Next, I am adding a function to prepare the input and the output. The input is a (448, 448, 3) image and the output is a (7, 7, 30) tensor. The output is based on S x S x (B * 5 +C).
S X S is the number of grids B is the number of bounding boxes per grid C is the number of predictions per grid
Training the model
Next, I am defining a custom generator that returns a batch of input and outputs.
The code snippet below, prepares arrays with inputs and outputs.
Next, we create instances of the generator for our training and validation sets.
(4, 448, 448, 3)
(4, 7, 7, 30)
(4, 448, 448, 3)
(4, 7, 7, 30)
Define a custom output layer
We need to reshape the output from the model so we define a custom Keras layer for it.
Defining the YOLO model.
Next, we define the model as described in the original paper.
Model: "sequential_5"
Layer (type) Output Shape Param #
conv2d_120 (Conv2D) (None, 448, 448, 64) 9472
max_pooling2d_20 (MaxPooling (None, 224, 224, 64) 0
conv2d_121 (Conv2D) (None, 224, 224, 192) 110784
max_pooling2d_21 (MaxPooling (None, 112, 112, 192) 0
conv2d_122 (Conv2D) (None, 112, 112, 128) 24704
conv2d_123 (Conv2D) (None, 112, 112, 256) 295168
conv2d_124 (Conv2D) (None, 112, 112, 256) 65792
conv2d_125 (Conv2D) (None, 112, 112, 512) 1180160
max_pooling2d_22 (MaxPooling (None, 56, 56, 512) 0
conv2d_126 (Conv2D) (None, 56, 56, 256) 131328
conv2d_127 (Conv2D) (None, 56, 56, 512) 1180160
conv2d_128 (Conv2D) (None, 56, 56, 256) 131328
conv2d_129 (Conv2D) (None, 56, 56, 512) 1180160
conv2d_130 (Conv2D) (None, 56, 56, 256) 131328
conv2d_131 (Conv2D) (None, 56, 56, 512) 1180160
conv2d_132 (Conv2D) (None, 56, 56, 256) 131328
conv2d_133 (Conv2D) (None, 56, 56, 512) 1180160
conv2d_134 (Conv2D) (None, 56, 56, 512) 262656
conv2d_135 (Conv2D) (None, 56, 56, 1024) 4719616
max_pooling2d_23 (MaxPooling (None, 28, 28, 1024) 0
conv2d_136 (Conv2D) (None, 28, 28, 512) 524800
conv2d_137 (Conv2D) (None, 28, 28, 1024) 4719616
conv2d_138 (Conv2D) (None, 28, 28, 512) 524800
conv2d_139 (Conv2D) (None, 28, 28, 1024) 4719616
conv2d_140 (Conv2D) (None, 28, 28, 1024) 9438208
conv2d_141 (Conv2D) (None, 14, 14, 1024) 9438208
conv2d_142 (Conv2D) (None, 12, 12, 1024) 9438208
conv2d_143 (Conv2D) (None, 10, 10, 1024) 9438208
flatten_5 (Flatten) (None, 102400) 0
dense_15 (Dense) (None, 512) 52429312
dense_16 (Dense) (None, 1024) 525312
dropout_5 (Dropout) (None, 1024) 0
dense_17 (Dense) (None, 1470) 1506750
yolo__reshape_10 (Yolo_Resha (None, 7, 7, 30) 0
Total params: 114,617,342
Trainable params: 114,617,342
Non-trainable params: 0
Define a custom learning rate scheduler
The paper uses different learning rates for different epochs. So we define a custom Callback function for the learning rate.
Define the loss function
Next, we would be defining a custom loss function to be used in the model. Take a look at this blog post to understand more about the loss function used in YOLO.
I understood the loss function but didn’t implement it on my own. I took the implementation as it is from this Github repo.
Add a callback for saving the weights
Next, I define a callback to keep saving the best weights.
Compile the model
Finally, I compile the model using the custom loss function that was defined above.
Train the model
Now that we have everything setup, we will call
to train the model for 135 epochs.
Epoch 00000: Learning rate is 0.0100.
Epoch 1/135
625/625 [==============================] - 195s 311ms/step - loss: 88.0331 - val_loss: 245.3397
Epoch 00001: Learning rate is 0.0100.
Epoch 2/135
625/625 [==============================] - 194s 310ms/step - loss: 140.9500 - val_loss: 116.6240
Epoch 00002: Learning rate is 0.0100.
Epoch 3/135
625/625 [==============================] - 194s 310ms/step - loss: 114.1760 - val_loss: 113.2524
Epoch 00003: Learning rate is 0.0100.
Epoch 4/135
625/625 [==============================] - 194s 310ms/step - loss: 113.0043 - val_loss: 112.8592
Epoch 00004: Learning rate is 0.0100.
Epoch 5/135
625/625 [==============================] - 189s 303ms/step - loss: 112.9847 - val_loss: 113.3475
Epoch 00005: Learning rate is 0.0100.
Epoch 6/135
625/625 [==============================] - 194s 310ms/step - loss: 113.0094 - val_loss: 112.7520
Epoch 00006: Learning rate is 0.0100.
Epoch 7/135
625/625 [==============================] - 194s 310ms/step - loss: 71.0617 - val_loss: 61.3470
It was a good exercise to implement YOLO V1 from scratch and understand various nuances of writing a model from scratch. This implementation won’t achieve the same accuracy as what was described in the paper since we have skipped the pretraining step.