Mastering Object Detection with YOLOv3 and COCO dataset

5 min readMar 16, 2023

In this blog post, I am going to explain Line by Line code Explanation for Yolov3 pre-trained object detection for the coco dataset which is having 80 labels. we can get the weights files and cfg files from the yolo official website https://pjreddie.com/darknet/yolo/




image = cv2.imread('./testing images/crosswalk-featured.jpg')
#cv2.imshow('image',image)
#cv2.waitKey()
#cv2.destroyAllWindows()
original_with , original_height = image.shape[1] , image.shape[0]

Neural_Network = cv2.dnn.readNetFromDarknet('./Files/yolov3.cfg','./Files/yolov3.weights')
classes_names = []
k = open('./Files/class_names','r')
for i in k.readlines():
    classes_names.append(i.strip())
#print(classes_names)
blob = cv2.dnn.blobFromImage(image , 1/255 , (320,320) , True , crop = False)
#print(blob.shape)
Neural_Network.setInput(blob)
cfg_data = Neural_Network.getLayerNames()
#print(cfg_data)
layer_names = Neural_Network.getUnconnectedOutLayers()
outputs = [cfg_data[i-1] for i in layer_names]
#print(outputs)
output_data = Neural_Network.forward(outputs)
prediction_box , bounding_box , confidence , class_labels = bounding_box_prediction(output_data)
final_prediction(prediction_box , bounding_box , confidence , class_labels , original_with / 320 , original_height / 320 )

The first line reads an image file crosswalk-featured.jpg from the directory testing images and stores it as an array in the variable image.
The next two commented lines display the image using OpenCV.
The next line retrieves the dimensions (width and height) of the image and stores them in variables original_with and original_height, respectively.
The line cv2.dnn.readNetFromDarknet('./Files/yolov3.cfg','./Files/yolov3.weights') loads the pre-trained YOLOv3 model from the Darknet framework. The two arguments are paths to the configuration file and the weights file, respectively.
The next lines read the class names for the COCO dataset from the file class_names and store them in the list classes_names.
The cv2.dnn.blobFromImage() function creates a 4-dimensional blob from the input image. The blob is a standardized format that the neural network expects as input. The arguments passed are the input image, scaling factor, output size, and mean subtraction values.
The setInput() function of the neural network is used to set the input blob as the input to the network.
getLayerNames() function returns the names of all layers in the neural network.
getUnconnectedOutLayers() function returns the indices of the output layers that are not connected to any other layer. In the YOLOv3 model, the output layer indices are 82, 94, and 106.
forward() function is used to perform a forward pass of the neural network and obtain the output predictions for the given input blob. The outputs variable is a list of outputs from the unconnected output layers.
The bounding_box_prediction() function is called, which extracts bounding box coordinates, class labels, and confidence scores from the output predictions using the IOU (Intersection over Union) technique.
The final_prediction() function is called to draw the predicted bounding boxes on the input image, along with the predicted class label and confidence score. The original_with / 320 and original_height / 320 are the scaling factors used to convert the bounding box coordinates to the original size of the input image.


def bounding_box_prediction(output_data):
    bounding_box = []
    class_labels = []
    confidence_score = []
    for i in output_data:
        for j in i:
            high_label = j[5:]
            classes_ids = np.argmax(high_label)
            confidence = high_label[classes_ids]
            
            if confidence > Threshold:
                w , h = int(j[2] * image_size) , int(j[3] * image_size)
                x , y = int(j[0] * image_size - w/2) , int(j[1] * image_size - h/2)
                bounding_box.append([x,y,w,h])
                class_labels.append(classes_ids)
                confidence_score.append(confidence)

    prediction_boxes = cv2.dnn.NMSBoxes(bounding_box , confidence_score , Threshold , .6)    
    return prediction_boxes , bounding_box ,confidence_score,class_labels

The function bounding_box_prediction() is used to get the bounding box, class label, and confidence score of each detected object. It takes in output_data (the output of the YOLOv3 neural network) as an argument. Inside the function, you're iterating over each element in output_data to get the bounding box coordinates, class label, and confidence score. You're using argmax() to get the index of the highest confidence score. If the confidence score is higher than the Threshold, the bounding box coordinates, class label, and confidence score are appended to their respective lists. Finally, the non-maximum suppression algorithm is applied to the bounding boxes using `cv2.dnn.NMSBoxes.

import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 


font = cv2.FONT_HERSHEY_COMPLEX

Threshold = 0.5
image_size = 320


def final_prediction(prediction_box , bounding_box , confidence , class_labels,width_ratio,height_ratio):
    for j in prediction_box.flatten():
        x, y , w , h = bounding_box[j]
        x = int(x * width_ratio)
        y = int(y * height_ratio)
        w = int(w * width_ratio)
        h = int(h * height_ratio)

        label = str(classes_names[class_labels[j]])
        conf_ = str(round(confidence[j],2))
        cv2.rectangle(image , (x,y) , (x+w , y+h) , (0,0,255) , 2)
        cv2.putText(image , label+' '+conf_ , (x , y-2) , font , .2 , (0,255,0),1)

In this section, you’re importing some necessary libraries for your model, including NumPy, Pandas, and Matplotlib. You’re also setting the font to be used in displaying the label and confidence score of the detected object. The Threshold variable sets the minimum threshold for confidence scores, while the image_size variable is the size of the image to be processed.

This function final_prediction() is used to draw the bounding box around the detected object and display the label and confidence score. It takes in prediction_box (the output of the NMS algorithm), bounding_box (coordinates of the bounding box), confidence (the confidence score of the detected object), class_labels (the class label of the detected object), width_ratio (the ratio of original image width to processed image width), and height_ratio (the ratio of original image height to processed image height) as arguments.

Test Image:

Complete code:

import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 


font = cv2.FONT_HERSHEY_COMPLEX

Threshold = 0.5
image_size = 320


def final_prediction(prediction_box , bounding_box , confidence , class_labels,width_ratio,height_ratio):
    for j in prediction_box.flatten():
        x, y , w , h = bounding_box[j]
        x = int(x * width_ratio)
        y = int(y * height_ratio)
        w = int(w * width_ratio)
        h = int(h * height_ratio)

        label = str(classes_names[class_labels[j]])
        conf_ = str(round(confidence[j],2))
        cv2.rectangle(image , (x,y) , (x+w , y+h) , (0,0,255) , 2)
        cv2.putText(image , label+' '+conf_ , (x , y-2) , font , .2 , (0,255,0),1)

def bounding_box_prediction(output_data):
    bounding_box = []
    class_labels = []
    confidence_score = []
    for i in output_data:
        for j in i:
            high_label = j[5:]
            classes_ids = np.argmax(high_label)
            confidence = high_label[classes_ids]
            
            if confidence > Threshold:
                w , h = int(j[2] * image_size) , int(j[3] * image_size)
                x , y = int(j[0] * image_size - w/2) , int(j[1] * image_size - h/2)
                bounding_box.append([x,y,w,h])
                class_labels.append(classes_ids)
                confidence_score.append(confidence)

    prediction_boxes = cv2.dnn.NMSBoxes(bounding_box , confidence_score , Threshold , .6)    
    return prediction_boxes , bounding_box ,confidence_score,class_labels





image = cv2.imread('./testing images/crosswalk-featured.jpg')
#cv2.imshow('image',image)
#cv2.waitKey()
#cv2.destroyAllWindows()
original_with , original_height = image.shape[1] , image.shape[0]

Neural_Network = cv2.dnn.readNetFromDarknet('./Files/yolov3.cfg','./Files/yolov3.weights')
classes_names = []
k = open('./Files/class_names','r')
for i in k.readlines():
    classes_names.append(i.strip())
#print(classes_names)
blob = cv2.dnn.blobFromImage(image , 1/255 , (320,320) , True , crop = False)
#print(blob.shape)
Neural_Network.setInput(blob)
cfg_data = Neural_Network.getLayerNames()
#print(cfg_data)
layer_names = Neural_Network.getUnconnectedOutLayers()
outputs = [cfg_data[i-1] for i in layer_names]
#print(outputs)
output_data = Neural_Network.forward(outputs)
prediction_box , bounding_box , confidence , class_labels = bounding_box_prediction(output_data)
final_prediction(prediction_box , bounding_box , confidence , class_labels , original_with / 320 , original_height / 320 )

Yolov3 Detection:

You can get the complete Yolov3 architecture explanation from here: https://medium.com/p/74cf9ade2044/edit

LinkedIn: https://www.linkedin.com/feed/

Computer vision Blogs: https://medium.com/me/stories/public

Mastering Object Detection with YOLOv3 and COCO dataset

Test Image:

Complete code:

Yolov3 Detection:

Written by kamal_DS