Real-Time Object Detection with YOLO and RealSense Depth Camera

Introduction

This project implements real-time object detection and segmentation using the YOLO (You Only Look Once) model, integrated with a RealSense camera. The script captures video frames from the RealSense camera applies object detection, overlays segmentation masks, and visualizes the results in real-time.

Object Detection

Object detection is a computer vision technique that identifies and locates objects within an image or video frame. This process involves both classifying objects and determining their positions, typically marked with bounding boxes.

RealSense

The RealSense camera is a series of depth-sensing cameras developed by Intel, designed to capture 3D spatial data and enable depth perception in various applications, ranging from virtual reality and augmented reality to robotics and gesture recognition.

Industrial Use Cases

Quality Control
Inventory Management
Safety Compliance Monitoring
Automation in Manufacturing
Robotics and Automated Guided Vehicles (AGVs)
Agricultural Automation
Surveillance and Security
Pharmaceuticals and Healthcare
Food and Beverage Industry
Mining and Construction

Machine Learning Models

Machine learning object detection models identify and locate objects in images or videos. They use algorithms, typically based on convolutional neural networks, to classify objects and pinpoint their positions with bounding boxes. Popular models like YOLO, SSD, and Faster R-CNN vary in speed and accuracy and are essential in applications like autonomous driving, surveillance, and augmented reality.

Model Comparison

Faster RCNN, YOLO, and SSD are three popular object detection systems that use deep learning to locate and classify objects in images. They differ in their architectures, speed, and accuracy. Here is a brief comparison of their main features:

Faster RCNN:: This system consists of two modules: a region proposal network (RPN) that generates candidate regions of interest (RoIs), and a Fast RCNN network that classifies and refines the RoIs. Faster RCNN is accurate and robust, but it is slow compared to the other two systems, as it requires multiple stages and computations.
YOLO:: This system divides the input image into a grid of cells, and predicts bounding boxes and class probabilities for each cell. YOLO is fast and efficient, as it performs object detection in a single pass through the network. However, it may struggle with small or overlapping objects, as it has a limited number of bounding boxes per cell.
SSD:: This system also performs object detection in a single pass, but it uses multiple feature maps of different resolutions to generate bounding boxes and class probabilities. SSD is faster than Faster RCNN and more accurate than YOLO, as it can detect objects of various sizes and shapes. However, it may still miss some small or occluded objects, as it relies on fixed aspect ratios and scales.

Faster RCNN:

YOLO:

SSD:

In summary, Faster RCNN is suitable for applications that require high accuracy and can tolerate low speed, such as medical image analysis or autonomous driving. YOLO is suitable for applications that require real-time performance and can tolerate some errors, such as video surveillance or sports analysis. SSD is a good compromise between speed and accuracy and can be used for general-purpose object detection tasks.

YOLOv8

YOLOv8 is the latest version of YOLO by Ultralytics. As a cutting-edge, state-of-the-art (SOTA) model, YOLOv8 builds on the success of previous versions, introducing new features and improvements for enhanced performance, flexibility, and efficiency. YOLOv8 supports a full range of vision AI tasks, including detection, segmentation, pose estimation, tracking, and classification. This versatility allows users to leverage YOLOv8's capabilities across diverse applications and domains.

Different functionalities

Detection Detection is the primary task supported by YOLOv8. It involves detecting objects in an image or video frame and drawing bounding boxes around them. The detected objects are classified into different categories based on their features. YOLOv8 can detect multiple objects in a single image or video frame with high accuracy and speed.

Segmentation Segmentation is a task that involves segmenting an image into different regions based on the content of the image. Each region is assigned a label based on its content. This task is useful in applications such as image segmentation and medical imaging. YOLOv8 uses a variant of the U-Net architecture to perform segmentation.

Classification Classification is a task that involves classifying an image into different categories. YOLOv8 can be used to classify images based on their content. It uses a variant of the Efficient Net architecture to perform classification.

Pose Pose/key point detection is a task that involves detecting specific points in an image or video frame. These points are referred to as key points and are used to track movement or pose estimation. YOLOv8 can detect key points in an image or video frame with high accuracy. and speed

Training

Training a deep learning model involves feeding it data and adjusting its parameters so that it can make accurate predictions. Train mode in Ultralytics YOLOv8 is engineered for effective and efficient training of object detection models, fully utilizing modern hardware capabilities. This guide aims to cover all the details you need to get started with training your own models using YOLOv8's robust set of features. Key Features of Train Mode The following are some notable features of YOLOv8's Train mode: Automatic Dataset Download: Standard datasets like COCO, VOC, and ImageNet are downloaded automatically on first use. Multi-GPU Support: Scale your training efforts seamlessly across multiple GPUs to expedite the process. Hyperparameter Configuration: The option to modify hyperparameters through YAML configuration files or CLI arguments. Visualization and Monitoring: Real-time tracking of training metrics and visualization of the learning process for better insights.

COCO training set

The COCO (Common Objects in Context) dataset is a large-scale collection of images used for object detection, segmentation, and captioning. It features over 200,000 images with detailed annotations for various objects in diverse environments. COCO128 is a smaller subset of the COCO dataset, containing approximately 128 images. It retains the full dataset's diverse scenarios and rich annotations but is more manageable for quick testing, prototyping, and educational purposes due to its smaller size.e.

Implementation

The integration of the RealSense depth camera and the YOLOv8 system is mainly achieved through a continuous feed processing loop. However, some preparation is necessary. In this section, the code is broken up into chunks to ease the readability and provide better commentary. As shown in the diagram, the main loop is characterized by a flowchart.

Prerequisites

Libraries used throughout the code is imported here.

In [ ]:

from ultralytics import YOLO
import cv2
import random
import numpy as np
import pyrealsense2 as rs
import math

Overlay Function

This function is in charge of overlaying a generated mask with its corresponding colour on top of the input image.

In [ ]:

def overlay(image, mask, color, alpha, resize=None):
    colored_mask = np.expand_dims(mask, 0).repeat(3, axis=0)
    colored_mask = np.moveaxis(colored_mask, 0, -1)
    masked = np.ma.MaskedArray(image, mask=colored_mask, fill_value=color)
    image_overlay = masked.filled()

    if resize is not None:
        image = cv2.resize(image.transpose(1, 2, 0), resize)
        image_overlay = cv2.resize(image_overlay.transpose(1, 2, 0), resize)
        
    image_combined = cv2.addWeighted(image, 1 - alpha, image_overlay, alpha, 0)
    return image_combined

Plot Box Function

This function plots a rectangle around a detected object while displaying the classification category, probability, and measured distance.

In [3]:

def plot_one_box(x, img, color=None, label=None, line_thickness=3):
    tl = line_thickness or round(0.002 * (img.shape[0] + img.shape[1]) / 2) + 1
    color = color or [random.randint(0, 255) for _ in range(3)]
    c1, c2 = (int(x[0]), int(x[1])), (int(x[2]), int(x[3]))
    cv2.rectangle(img, c1, c2, color, thickness=tl, lineType=cv2.LINE_AA)
    if label:
        tf = max(tl - 1, 1)
        t_size = cv2.getTextSize(label, 0, fontScale=tl / 3, thickness=tf)[0]
        c2 = c1[0] + t_size[0], c1[1] - t_size[1] - 3
        cv2.rectangle(img, c1, c2, color, -1, cv2.LINE_AA)
        cv2.putText(img, label, (c1[0], c1[1] - 2), 0, tl / 3, [225, 255, 255], thickness=tf, lineType=cv2.LINE_AA)

Camera Feed Preparation

To access the color and depth images of the RealSense camera, a pipeline is defined, configured, and started. It is also necessary to mention that an image alignment is also performed here so the positional data extracted from each frame is interchangeable.

In [ ]:

pipeline = rs.pipeline()
config = rs.config()
config.enable_stream(rs.stream.depth, 1280, 720, rs.format.z16, 30)
config.enable_stream(rs.stream.color, 1280, 720, rs.format.bgr8, 30)
align = rs.align(rs.stream.depth)

print("[INFO] Starting streaming...")
pipeline.start(config)
print("[INFO] Camera ready.")

Loading Model

Here, a model instance is initiated and a nano YOLOv8 segmentation model is stored. A list of detectable objects is also extracted and printed to the console.

In [ ]:

model = YOLO("yolov8n-seg.pt")
class_names = model.names
print('Class Names: ', class_names)
colors = [[random.randint(0, 255) for _ in range(3)] for _ in class_names]

Main Loop

With the necessary preparations made, an infinite loop can be constructed. Within each iteration, a color and depth frame pair is processed and displayed.

In [ ]:

frames = pipeline.wait_for_frames()
aligned_frames = align.process(frames)
depth_frame = aligned_frames.get_depth_frame()
aligned_color_frame = aligned_frames.get_color_frame()
color_frame = frames.get_color_frame()
if not depth_frame or not aligned_color_frame: continue

In [ ]:

color_intrin = aligned_color_frame.profile.as_video_stream_profile().intrinsics
depth_image = np.asanyarray(depth_frame.get_data())
color_image = np.asanyarray(color_frame.get_data())
h, w, _ = color_image.shape
results = model.predict(color_image, stream=True)

In [ ]:

for r in results:
        boxes = r.boxes  # Boxes object for bbox outputs
        masks = r.masks  # Masks object for segment masks outputs
        probs = r.probs  # Class probabilities for classification outputs

In [ ]:

if masks is not None:
    masks = masks.data.cpu()
    for seg, box in zip(masks.data.cpu().numpy(), boxes):
        seg = cv2.resize(seg, (w, h))
        color_image = overlay(color_image, seg, colors[int(box.cls)], 0.4)

        count = (seg == 1).sum()
        x, y = np.argwhere(seg == 1).sum(0) / count
        depth = depth_frame.get_distance(int(y), int(x))
        dx, dy, dz = rs.rs2_deproject_pixel_to_point(color_intrin, [y, x], depth)
        distance = math.sqrt(((dx) ** 2) + ((dy) ** 2) + ((dz) ** 2))

        xmin = int(box.data[0][0])
        ymin = int(box.data[0][1])
        xmax = int(box.data[0][2])
        ymax = int(box.data[0][3])

        plot_one_box([xmin, ymin, xmax, ymax], color_image, colors[int(box.cls)],
                     f'{class_names[int(box.cls)]} {float(box.conf):.3} {float(100*distance):.3}')

In [ ]:

cv2.imshow('img', color_image)
if cv2.waitKey(1) & 0xFF == ord('q'):

Clean-up and Resource Management

To free up memory and disengage the camera, the pipeline needs to be stopped at the end of the operation.

In [ ]:

print("[INFO] stop streaming ...")
pipeline.stop()

Live Demo

Sources

Technical information on RealSense Depth Cameras and the image: intelrealsense.com
Comparison between Faster RCNN, YOLO, and SSD: medium.com
YOLOv8 documentation and brand information: ultralytics.com
COCO documentation: cocodataset.org/
Graphics: freepik.com