Phronetic AI

Introduction

Proactive safety isn’t optional—it’s essential. In environments where lives, assets, and critical operations are on the line, the ability to detect safety incidents in real time can make the difference between escalation and intervention. From fires and falls to theft, assault, or SOS gestures, early recognition of such events enables timely response and better outcomes.

Imagine identifying a theft in a retail aisle, detecting smoke in a factory, recognizing an SOS signal in a parking lot, or intervening in an assault in a public space—as it happens. Real-time video understanding makes these interventions possible.

To accelerate this capability, we’re introducing owlet-safety-3b-1—a fine-tuned, purpose-built variant of the open-source Qwen2.5-VL-3B-Instruct model. Owlet is trained for multi-label safety event detection in video, with a strong focus on high-impact incidents such as theft, assault, and SOS triggers. It also detects other hazardous events like fire, smoke, and falls, enabling comprehensive real-time situational awareness —unlocking scalable, automated safety intelligence from video streams.

Balancing Accuracy, Speed and Deployment Flexibility

Building vision-language models for real-world safety monitoring—where incidents like fire, falls, theft, or assault must be detected from video input—requires a thoughtful balance between model performance, operational control, and infrastructure efficiency.

Commercial multi-modal APIs and hosted foundation models offer zero-shot capabilities and fast integration. But in safety-critical domains, they come with critical limitations:

High inference costs that scale with token length, usage volume, and model size
Limited ability to customize or fine-tune for specific safety scenarios
Data privacy concerns, especially when processing sensitive surveillance footage
Cloud dependency, which restricts offline or edge deployment options

To address these constraints, small open-source vision-language models offer a compelling alternative—enabling full control over the inference pipeline and flexible deployment across cloud, on-premises, or edge environments. Among the models we evaluated, Qwen2.5-VL-3B-Instruct emerged as the most suitable choice, delivering strong performance on our multi-label safety event detection benchmarks while maintaining a lightweight footprint ideal for deployment on modest GPU hardware. Its native support for multi-modal inputs (video, image, and text) and conversational prompting capabilities made it particularly well-suited for rapid development, domain-specific fine-tuning, and scalable production deployment.

Training Data

To fine-tune the model for safety event detection, we curated a dataset of short video clips, each tagged with one or more safety-related labels. The labels included assault, fall, fire, smoke, SOS, theft, and none (for safe or uneventful videos). Unlike regular classification, where each clip has only one label, safety events can overlap—for example, a video might show both fire and smoke. This made it a multi-label task, with videos having multiple labels at the same time. We split the dataset into training and test sets in an 80/20 ratio to ensure each label was well represented for effective training and evaluation.

Fine-Tuning for Safety

To adapt Qwen for multi-label safety event detection, we employed parameter-efficient fine-tuning using LoRA (Low-Rank Adaptation). This technique allowed us to update only a small subset of the model’s parameters, dramatically reducing both computational overhead and training cost.

We used the LLaMA-Factory framework, which offers native support for Qwen models, LoRA integration, and multi-modal instruction formatting—making it easy to set up a fine-tuning pipeline tailored to video-based inputs. To manage GPU memory efficiently during training, we constrained the video token budget to 16,384 pixels per input (video_max_pixels = 16384). This helped ensure stable training across batches of multi-frame video clips.

The entire fine-tuning run was completed in under an hour on a single NVIDIA A10G GPU, with the total compute cost coming in at under $1—highlighting the efficiency and accessibility of this approach.

Evaluation

We evaluated the fine-tuned model, owlet-safety-3b-1, on a held-out test set to measure its effectiveness in detecting safety events from real-world video clips. Given the multi-label nature of the task—where each video may be associated with multiple safety categories from a predefined set—we compared our model’s performance against two baselines: Qwen2.5-3B-Instruct and SmolVLM2-2.2B-Instruct.

To thoroughly assess performance, we used standard multi-label classification metrics such as precision, recall, and F1 score, which together reflect the model’s ability to make accurate and comprehensive predictions. We also report accuracy-based metrics, with a particular emphasis on partial match accuracy—a crucial measure in real-world settings where detecting even part of an event can be valuable for timely intervention.

Precision, Recall and F1

Fine-tuning Qwen2.5-VL-3B on our safety dataset resulted in dramatic improvements across all key classification metrics. Precision increased from 0.37 to 0.74, nearly doubling the model’s ability to avoid false positives. Recall rose from 0.48 to 0.79, showing a substantial improvement in the model’s ability to detect relevant safety events—critical in high-risk, real-world settings. Most notably, the F1 score, which balances precision and recall, climbed from 0.42 to 0.77, representing an 83% relative improvement over the base model. Compared to both the base Qwen and SmolVLM2 baselines, the fine-tuned model—owlet-safety-3b-1—delivers a significantly more reliable and accurate solution for multi-label safety event detection in video.

Accuracy

In multi-label classification, we use two accuracy measures to evaluate model performance:

Exact Match Accuracy is a strict metric that requires the model to predict all of the ground-truth labels for a video exactly and completely. Even if one relevant label is missed, the prediction is considered incorrect.
- In other words, for each video, the model must correctly identify every ground-truth label assigned to it.
Partial Match Accuracy is more lenient—it gives credit if the model correctly predicts at least one of the ground-truth labels for a video.

‍

Exact Match Accuracy—which requires the model to predict all correct labels for a video—jumped from 0.22 (base Qwen) to 0.65, a nearly 3× improvement. Likewise, Partial Match Accuracy—where the model must predict at least one correct label—increased from 0.54 to 0.81. These gains mean the fine-tuned model, owlet-safety-3b-1, is significantly better at identifying complex, multi-event safety scenarios and more reliable in real-world conditions where multiple risks may co-occur.

Model Details

The model, owlet-safety-3b-1, is now available. It supports multi-modal, chat-style prompts that accept both video and text inputs, and outputs a comma-separated list of detected safety-related labels.

Conclusion

owlet-safety-3B-1 demonstrates how compact, fine-tuned vision-language models can deliver state-of-the-art performance in real-time safety event detection from video. By leveraging a carefully curated dataset and efficient LoRA-based training, Owlet significantly outperforms baseline models across precision, recall, F1 score, and accuracy—while remaining lightweight enough for deployment on modest infrastructure.

Designed to detect critical events such as theft, assault, SOS triggers, fire, smoke, and falls, Owlet offers a practical, scalable solution for proactive safety monitoring across industrial, commercial, and public settings. Its strong performance, low compute requirements, and flexible deployment options make it an ideal choice for embedding safety intelligence into video pipelines—whether deployed in the cloud or on-premises.

Follow PhroneticAI on LinkedIn for more such blogs.