Warmup

Detecting people and objects in a scene sequence is really tough. With AI, we have an opportunity to speed up this kind of work.

Below, this is a scene from the movie "George" directed by David Coudyser and where I am the co-producer with David:

On this scene, we need to outline each actor in every frame. It will be painful to do that manually to have this :

Most tools and software already feature this kind of selector. For most of them, they use complex algorithms to performs this ^A. You need to click on a specific area and the tracking system will try to follow the object.

Here, we don't have to do that, we just need to "describe" what we want to track, like an "actor" or a "boom".

Using AI in this context could be very helpful.

For a prototype, we will use OpenCV (for the video and frame management) and particularly the YOLO model from Ultralytics, we can outline each described element (eg. objects, people, animals) as follows; Here we track a "person" :

This is a sample output from our program below, it is not perfect - we need to customize some values and parameters (confiance, new_track_thresh, etc.), but we have a good first step for the next stages.

The source code

This code works but is ugly and intended for practice purposes only

First at all, we need to install opencv library and ultralytics library :

						pip install numpy
						pip install opencv-python
						pip install ultralytics

And this is the source code using OpenCV for the video/frame part and Ultralytics / YOLO for the AI part

						import cv2
						import numpy as np
						from ultralytics import YOLO

						model = YOLO("yolov8n-seg.pt")
						inputfile = cv2.VideoCapture("input.mp4")

						# fseek (seconds)
						timecode = 0
						inputfile.set(cv2.CAP_PROP_POS_MSEC, timecode)

						# fseek (frames)
						# inputfile.set(cv2.CAP_PROP_POS_FRAMES, 1200)

						# output
						fourcc     = cv2.VideoWriter_fourcc(*'MP4V')
						fps        = inputfile.get(cv2.CAP_PROP_FPS)
						width      = int(inputfile.get(cv2.CAP_PROP_FRAME_WIDTH))
						height     = int(inputfile.get(cv2.CAP_PROP_FRAME_HEIGHT))
						outputfile = cv2.VideoWriter("output.mp4", fourcc, fps, (width, height))

						while inputfile.isOpened():

							# read frame
							retval, frame = inputfile.read()

							if not retval:
								break

							if cv2.waitKey(1) & 0xFF == ord('q'):
								break

							results = model.track(
								frame,
								conf=0.4,
								persist=True,
								verbose=False,
								tracker="tracker.yml"
							)

							if len(results) < 1:
								print("no result in this frame")
								continue

							# get the first frame
							result = results[0]

							for index in range(len(result.boxes.id)):

								object_tracking_id = result.boxes.id[index]
								object_confiance   = result.boxes.conf[index]
								object_points      = result.masks[index].xy
								object_class_id    = result.boxes.cls[index]
								object_class_name  = model.names[int(object_class_id)]

								print(f"{index} - {object_tracking_id} - {object_class_id} - {object_confiance} - {len(object_points[0])}")

								# draw polygon directly in the frame buffer
								if object_class_name == "person" and (object_confiance * 100) > 60.0:
									points = np.array(object_points, dtype=np.int32)
									cv2.fillPoly(frame, [points], (0, 255, 0))

							cv2.imshow("Preview", frame)

							# write to output
							outputfile.write(frame)

						inputfile.release()
						outputfile.release()

						cv2.destroyAllWindows()

This code follows this steps :

Read the input video file
Read each frame
Feed them into the AI model
Get the results which are all detected "objects"
Draw a polygon around each "object"
Finally, write the output video file including the polygons

Let's take a look at part of the source code!

You will study some important parts on this source code starting with the YOLO's function model.track :

						model = YOLO("yolov8n-seg.pt")

YOLO is the main class that it loads our model and returns a handler (here model). You can use directly model(content) but you won't get tracking information - only detection data.

Don't worry about the model file yolov8n-seg.pt, the YOLO library will download automatically the selected model. Note that's important to use the 'seg' model. The file size of this model is only few Megabytes.

For our use case, we use the track function :

						results = model.track(
								frame,
								conf=0.4,
								persist=True,
								verbose=False,
								tracker="tracker.yml"
						)

frame in track is the source of the data: our frame data. Here, we store the buffer of the frame previously fetched by the OpenCV read function :

						retval, frame = inputfile.read()

track function returns a complete data structure which contains boxes and masks arrays (and many other values and methods)

Normally, results stores each frame. Here, OpenCV reads a frame and feeds it to the track function. In this case, the results stores only one frame.

All we need is in the arrays in the results such as .boxes and .masks :

						object_tracking_id = result.boxes.id[index]
						object_confiance   = result.boxes.conf[index]
						object_points      = result.masks[index].xy
						object_class_id    = result.boxes.cls[index]
						object_class_name  = model.names[int(object_class_id)]

need modifications to link this paragraph

Each array follows a rule: indexes are linked between arrays :

						[0] = { 1, 0, 0.8144 }
						[1] = { 2, 0, 0.8274 }

Which we can translate as :

Object ID	Object Type	Confiance
1	0 (person)	0.8144
2	0 (person)	0.8274

So, if we need to work with each detected "object", we can do that :

						for index in range(len(result.boxes.id)):
							print(f"{index} : {result.boxes.id[index]} - {result.boxes.cls[index]} - {result.boxes.conf[index]} - {len(result.masks[index].xy[0])}")

With this code, the output will contain everything we need :

						# video 1/1 (frame 62/240) input.mp4: 384x640 2 persons, 59.0ms
						0 : 1.0 - 0.0 - 0.6531599760055542 - 252
						1 : 2.0 - 0.0 - 0.7431451678276062 - 309

						# video 1/1 (frame 63/240) input.mp4: 384x640 1 person, 61.9ms
						0 : 2.0 - 0.0 - 0.7039548754692078 - 220

tracker.yml

The configuration file isn't mandatory with YOLO. We use it only to optimize some part of the detection and processing. Parameters and descriptions are available here.

					tracker_type: bytetrack   # (str) Tracker backend: botsort|bytetrack; choose bytetrack for the classic baseline
					track_high_thresh: 0.25   # (float) First-stage match threshold; raise for cleaner tracks, lower to keep more
					track_low_thresh: 0.1     # (float) Second-stage threshold for low-score matches; balances recovery vs drift
					new_track_thresh: 0.25    # (float) Start a new track if no match ≥ this; higher reduces false tracks
					track_buffer: 30          # (int) Frames to keep lost tracks alive; higher handles occlusion, increases ID switches risk
					match_thresh: 0.8         # (float) Association similarity threshold (IoU/cost); tune with detector quality
					fuse_score: True          # (bool) Fuse detection score with motion/IoU for matching; stabilizes weak detections

This configuration is based on the original configuration file available here.

Using YOLO without OpenCV ?

Yes, that's possible: remember, track function requires a source. In the previous example, our source is the frame content but the source can also be a video directly. In this case, the variable results store each frame of the video :

						model = YOLO("yolov8n-seg.pt")
						results = model.track(
							"input.mp4",
							conf=0.4,
							persist=True,
							stream=True,
							tracker="tracker.yml",
						)  
						for result in results:
							print("image orig shape =", result.orig_shape[0], result.orig_shape[1])
							print("image orig data  =", len(result.orig_img), len(result.orig_img[0]), type(result.orig_img))
							print("ids  =", result.boxes.id)
							print("cls  =", result.boxes.cls)
							print("conf =", result.boxes.conf)
							print("mask =", len(result.masks))

The output of this program will be :

						# video 1/1 (frame 60/240) input.mp4: 384x640 2 persons, 53.3ms
						image orig shape = 1080 1920
						image orig data  = 1080 1920 <class 'numpy.ndarray'>
						ids  = tensor([1., 2.])
						cls  = tensor([0., 0.])
						conf = tensor([0.8144, 0.8274])
						mask = 2

						# video 1/1 (frame 61/240) input.mp4: 384x640 1 person, 46.8ms
						image orig shape = 1080 1920
						image orig data  = 1080 1920 <class 'numpy.ndarray'>
						ids  = tensor([2.])
						cls  = tensor([0.])
						conf = tensor([0.8059])
						mask = 1

As you can see, we have a full access to the data frame via the orig_img variable, as well as to the .boxes and .mask variables. Therefore, the "rule of indexing" is the same as those seen in the previous chapter.

So, you have two different methods to do the same thing :

Read the frame via OpenCV, process the tracking via YOLO, and then the draw polygons via OpenCV,
Read the frame via YOLO, process the tracking via YOLO, and draw the polygons using another library (or your own method)
(plot twist with the 3rd method) Read the frame using another video management library (such as ffmpeg), process the tracking via YOLO, etc.

Conclusion

As I said, this code isn't perfect (and yes, it's even a bit ugly). As you probably saw in the sample output video, some people are not detected in a few frames. It's mainly a prototype, a first draft. Therefore, it's normal to have this kind of issue at this stage.

To improve the processing, we could investigate technical options such as :

First at all, we use the v8 YOLO model, Ultralytics has released newer models. We probably need to use a fewer version;
Second, we need to adjust some parameters to improve the detection (see the tracker configuration).
And finally, to boost performance, you also can "train" one of the YOLO models to increase the model's accuracy.

Using YOLO here is just an example. Many other models exist and work well too and are available in the Hugging Face Hub :

In its current state, with few modifications, the code inside of the program above could be used as a first step in organizing your daily rushes (dailies). (but we need to remove some unnecessary code for this case, such as the masks and the writing process and we only need to get the name of the "object" detected (class) for the classification).

Footnotes

For archives: app_with_opencv.py, app_without_opencv.py, app.yolo.tracker.yml, yolov8n-seg.pt
The Ultralytics website contains many new models you can use. Here, we use only the v8.
Track() function documentation needed to track each object with an ID.
Tracker file documentation and all informations about each parameters available to optimize detection and process.
The github repository for ultralytics source code.
If necessary, this is the direct link to yolov8n-seg.pt model file published by Ultralytics and stored into github repository.
Hugging Face Hub: Image segmentation models, Object detection models, Keypoint detection models

A And some now use AI models as well.