https://github.com/jiucai233/DSL13thEnterpriseProject

Project summary

<aside>

This project implements a YOLOv8 object detection model for identifying objects in video and image data.

Application circumstance: For the food delivery company, it will be helpful having a food picture right before the lid closes for the customer services. So this model is being finetuned for capture the food picture and video.

</aside>

Project background

<aside>

As you may see: this is conducted by the Data Science Lab @Yonsei University. There are 2 teams to do the same topic project. And our team has 6 people.

The topic is from @BlitzDynamics, a company from South Korea, our team aims to develop an AI model (object detection) for the task. They asked us to find a light-weight model while don’t need a lot of computing resources

</aside>

Parts by me

<aside>

I firstly was in charge of test the YOLOv8n model for the task. Luckily this model got the best performance among the models.

for the Data collection and preprocessing:

Our source is the video from the monitoring camera on the top of the kitchen. Which can observe the whole process of the food preparation

And I did 6 hour data labeling, divided the box into 4 categories:

closed delivery box
delivery box
hand
food

data labeling:

capture of the video:

Model Training:

Train detail (the actual code can be found in the Github):

The whole train is conducted in local CPU, Took 3 hours. mAP50-95 is decent.

Algorithm of the project:

After the model came out, fortunately the model can distinguish the lid is close or open, so we decided-

based on the the moment of label (closed box → box) changes, collect the 10 frame before and after this.

But the problem is:

model sometimes predict the label to the closed box in a short period, which means will collect the wrong video
the time of the video is fixed

So the solution is:

set a time & confidence level of the label. e.g. the “closed delivery box” label have to over 0.75 confidence and keep emerging over 10 frame.

read the fps of the video, and set this as a parameter in the code:

    pre_frames = max(1, math.floor(fps * pre_sec))
    post_frames = max(1, math.floor(fps * post_sec))
    ...
    start = max(0, idx - pre_frames)
    end = min(len(frames) - 1, idx + post_frames)
    raw_video_path = os.path.join(raw_dir, 'clip_raw.mp4')
    writer = cv2.VideoWriter(raw_video_path, fourcc, fps, (w, h))
    for f in frames[start:end+1]:
        writer.write(f)
    writer.release()

while the fps is detected by code and the pre_sec is a parameter which can be changed by user

Afterward work

I created the Github repo and attended the company meeting, presented the test example

Result example

Result of id7 food with boxes

picture without boxes

result for food with boxes

</aside>

My comment

<aside>

First time experiencing object detection and vision model finetuning, and data labeling is soo tired. But we got a pretty good result. As for my last scheduled project in DSL, it’s a perfect farewell.

Big thanks to all my teammates.

</aside>

DSL_Corp_CV.pptx