Real-Time Object Detection with YOLO and RealSense Depth Camera
Introduction
This project implements real-time object detection and segmentation using the YOLO (You Only Look Once) model, integrated with a RealSense camera. The script captures video frames from the RealSense camera applies object detection, overlays segmentation masks, and visualizes the results in real-time.
Object Detection
Object detection is a computer vision technique that identifies and locates objects within an image or video frame. This process involves both classifying objects and determining their positions, typically marked with bounding boxes. |
![]() |
RealSense
The RealSense camera is a series of depth-sensing cameras developed by Intel, designed to capture 3D spatial data and enable depth perception in various applications, ranging from virtual reality and augmented reality to robotics and gesture recognition. |
![]() |
Industrial Use Cases
|
![]() |
Machine Learning Models
Machine learning object detection models identify and locate objects in images or videos. They use algorithms, typically based on convolutional neural networks, to classify objects and pinpoint their positions with bounding boxes. Popular models like YOLO, SSD, and Faster R-CNN vary in speed and accuracy and are essential in applications like autonomous driving, surveillance, and augmented reality.
Model Comparison
Faster RCNN, YOLO, and SSD are three popular object detection systems that use deep learning to locate and classify objects in images. They differ in their architectures, speed, and accuracy. Here is a brief comparison of their main features: |
![]() |
|
|
In summary, Faster RCNN is suitable for applications that require high accuracy and can tolerate low speed, such as medical image analysis or autonomous driving. YOLO is suitable for applications that require real-time performance and can tolerate some errors, such as video surveillance or sports analysis. SSD is a good compromise between speed and accuracy and can be used for general-purpose object detection tasks.
YOLOv8
YOLOv8 is the latest version of YOLO by Ultralytics. As a cutting-edge, state-of-the-art (SOTA) model, YOLOv8 builds on the success of previous versions, introducing new features and improvements for enhanced performance, flexibility, and efficiency. YOLOv8 supports a full range of vision AI tasks, including detection, segmentation, pose estimation, tracking, and classification. This versatility allows users to leverage YOLOv8's capabilities across diverse applications and domains. |
|
Different functionalities

Training
Object detection is a computer vision technique that identifies and locates objects within an image or video frame. This process involves both classifying objects and determining their positions, typically marked with bounding boxes.

COCO training set
The COCO (Common Objects in Context) dataset is a large-scale collection of images used for object detection, segmentation, and captioning. It features over 200,000 images with detailed annotations for various objects in diverse environments. COCO128 is a smaller subset of the COCO dataset, containing approximately 128 images. It retains the full dataset's diverse scenarios and rich annotations but is more manageable for quick testing, prototyping, and educational purposes due to its smaller size.e. |
![]() |
Implementation
The integration of the RealSense depth camera and the YOLOv8 system is mainly achieved through a continuous feed processing loop. However, some preparation is necessary. In this section, the code is broken up into chunks to ease the readability and provide better commentary. As shown in the diagram, the main loop is characterized by a flowchart. |
![]() |
Prerequisites
Libraries used throughout the code is imported here.
from ultralytics import YOLO
import cv2
import random
import numpy as np
import pyrealsense2 as rs
import math
Overlay Function
This function is in charge of overlaying a generated mask with its corresponding colour on top of the input image.
def overlay(image, mask, color, alpha, resize=None):
colored_mask = np.expand_dims(mask, 0).repeat(3, axis=0)
colored_mask = np.moveaxis(colored_mask, 0, -1)
masked = np.ma.MaskedArray(image, mask=colored_mask, fill_value=color)
image_overlay = masked.filled()
if resize is not None:
image = cv2.resize(image.transpose(1, 2, 0), resize)
image_overlay = cv2.resize(image_overlay.transpose(1, 2, 0), resize)
image_combined = cv2.addWeighted(image, 1 - alpha, image_overlay, alpha, 0)
return image_combined
Plot Box Function
This function plots a rectangle around a detected object while displaying the classification category, probability, and measured distance.
def plot_one_box(x, img, color=None, label=None, line_thickness=3):
tl = line_thickness or round(0.002 * (img.shape[0] + img.shape[1]) / 2) + 1
color = color or [random.randint(0, 255) for _ in range(3)]
c1, c2 = (int(x[0]), int(x[1])), (int(x[2]), int(x[3]))
cv2.rectangle(img, c1, c2, color, thickness=tl, lineType=cv2.LINE_AA)
if label:
tf = max(tl - 1, 1)
t_size = cv2.getTextSize(label, 0, fontScale=tl / 3, thickness=tf)[0]
c2 = c1[0] + t_size[0], c1[1] - t_size[1] - 3
cv2.rectangle(img, c1, c2, color, -1, cv2.LINE_AA)
cv2.putText(img, label, (c1[0], c1[1] - 2), 0, tl / 3, [225, 255, 255], thickness=tf, lineType=cv2.LINE_AA)
Camera Feed Preparation
To access the color and depth images of the RealSense camera, a pipeline is defined, configured, and started. It is also necessary to mention that an image alignment is also performed here so the positional data extracted from each frame is interchangeable.
pipeline = rs.pipeline()
config = rs.config()
config.enable_stream(rs.stream.depth, 1280, 720, rs.format.z16, 30)
config.enable_stream(rs.stream.color, 1280, 720, rs.format.bgr8, 30)
align = rs.align(rs.stream.depth)
print("[INFO] Starting streaming...")
pipeline.start(config)
print("[INFO] Camera ready.")
Loading Model
Here, a model instance is initiated and a nano YOLOv8 segmentation model is stored. A list of detectable objects is also extracted and printed to the console.
model = YOLO("yolov8n-seg.pt")
class_names = model.names
print('Class Names: ', class_names)
colors = [[random.randint(0, 255) for _ in range(3)] for _ in class_names]
Main Loop
With the necessary preparations made, an infinite loop can be constructed. Within each iteration, a color and depth frame pair is processed and displayed.
frames = pipeline.wait_for_frames()
aligned_frames = align.process(frames)
depth_frame = aligned_frames.get_depth_frame()
aligned_color_frame = aligned_frames.get_color_frame()
color_frame = frames.get_color_frame()
if not depth_frame or not aligned_color_frame: continue
color_intrin = aligned_color_frame.profile.as_video_stream_profile().intrinsics
depth_image = np.asanyarray(depth_frame.get_data())
color_image = np.asanyarray(color_frame.get_data())
h, w, _ = color_image.shape
results = model.predict(color_image, stream=True)
for r in results:
boxes = r.boxes # Boxes object for bbox outputs
masks = r.masks # Masks object for segment masks outputs
probs = r.probs # Class probabilities for classification outputs
if masks is not None:
masks = masks.data.cpu()
for seg, box in zip(masks.data.cpu().numpy(), boxes):
seg = cv2.resize(seg, (w, h))
color_image = overlay(color_image, seg, colors[int(box.cls)], 0.4)
count = (seg == 1).sum()
x, y = np.argwhere(seg == 1).sum(0) / count
depth = depth_frame.get_distance(int(y), int(x))
dx, dy, dz = rs.rs2_deproject_pixel_to_point(color_intrin, [y, x], depth)
distance = math.sqrt(((dx) ** 2) + ((dy) ** 2) + ((dz) ** 2))
xmin = int(box.data[0][0])
ymin = int(box.data[0][1])
xmax = int(box.data[0][2])
ymax = int(box.data[0][3])
plot_one_box([xmin, ymin, xmax, ymax], color_image, colors[int(box.cls)],
f'{class_names[int(box.cls)]} {float(box.conf):.3} {float(100*distance):.3}')
cv2.imshow('img', color_image)
if cv2.waitKey(1) & 0xFF == ord('q'):
Clean-up and Resource Management
To free up memory and disengage the camera, the pipeline needs to be stopped at the end of the operation.
print("[INFO] stop streaming ...")
pipeline.stop()
Live Demo
Sources
- Technical information on RealSense Depth Cameras and the image: intelrealsense.com
- Comparison between Faster RCNN, YOLO, and SSD: medium.com
- YOLOv8 documentation and brand information: ultralytics.com
- COCO documentation: cocodataset.org/
- Graphics: freepik.com
