Owlet-HAR-1: Building Better VLM for Human Activity Recognition

Understanding human activities in videos is a challenging task that has significant applications in sports analysis, healthcare monitoring, and autonomous systems. While recent Vision Language Models (VLMs) have shown remarkable progress in video understanding, we set out to build something better—a model specifically optimized for human activity recognition. 

Today, we're releasing Owlet-HAR-1, a fine-tuned Vision Language Model that achieves 68.19% accuracy on the HMDB51 dataset, representing a substantial improvement over base model performance. This research investigates whether overlaying human pose keypoints onto videos enhances activity recognition.

Motivation

Our motivation stems from a fundamental observation: humans naturally recognize activities by focusing on body movements and skeletal structure. If explicit pose information is so crucial to human perception, can we leverage this same structural knowledge to build better VLM models for activity recognition?

This research explores whether incorporating pose information into Vision Language Models can advance the field and build more accurate and practical HAR systems.

Dataset and Methodology

HMDB51: A Comprehensive Action Recognition Dataset

We used the HMDB51 dataset, which contains 6,849 video clips across 51 distinct human action categories. The 51 action categories are organized into five groups:

  • General Facial Actions: smile, laugh, chew, talk
  • Facial Actions with Object Manipulation: smoke, eat, drink
  • General Body Movements: cartwheel, handstand, jump, run, walk, etc.
  • Body Movements with Object Interaction: brush hair, catch, golf, shoot ball, etc.
  • Body Movements for Human Interaction: fencing, hug, kiss, punch, etc.

The dataset's diversity will allow us to understand the contribution of poses between complex actions and simple actions.

Training Setup

We conducted our experiments using LLaMA-Factory, a framework for fine-tuning large language models, running on AWS g5.2xlarge instances with NVIDIA A10G GPUs. Our base model was Qwen 2.5-3B-VL, a multimodal vision-language model capable of processing both images and videos.

Key Training Parameters:

  • Learning rate: 5e-05 with cosine annealing
  • Epochs: 3
  • Effective batch size: 16 (2 per device × 8 gradient accumulation steps)
  • Fine-tuning method: LoRA (Low-Rank Adaptation) with rank 8
  • Precision: BF16 for memory efficiency

In the training sample, the video is appended as the input with the text asking “What’s the activity the person is doing in this video?” 

    {"messages": [ {“content": "<video>What's the activity the person is doing in this video? Answer in one word only.",  "role": "user"  },   { "content": "{activity_name}", ”role": "assistant"  } ],    "videos": [      <video_path>]}

Experimental Design: Three-Way Comparison

To thoroughly evaluate the impact of pose enhancement, we compared three different approaches:

  1. Finetuned Model: Qwen 2.5-3B fine-tuned on original HMDB51 videos
  2. Finetuned with Pose Model: Same model fine-tuned on pose-enhanced videos
  3. Base Qwen: Base model without any fine-tuning as a baseline

For pose enhancement, we overlaid skeletal keypoints and joints directly onto the original video frames, creating a version of the dataset with explicit structural information.

Results

The experimental results revealed interesting patterns in how different training approaches affect video activity recognition:

Overall Performance Metrics

Key Findings

1. Fine-tuning Provides Substantial Improvements The most striking result is the dramatic improvement from fine-tuning. The base Qwen model achieved only 38.33% accuracy, while the fine-tuned model reached 68.19% - an improvement of nearly 30 percentage points. 

2. Pose Enhancement Shows Mixed Results Contrary to our initial hypothesis, adding pose keypoints actually decreased performance compared to the standard fine-tuned model. The pose-enhanced model achieved 57.84% accuracy, about 10 percentage points lower than the model trained on original videos.

3. Action-Specific Performance Patterns Looking at individual action categories reveals interesting patterns:

Best Performing Actions (Finetuned Model):

  • cartwheeling, drawing_sword, falling, grooming, punching: 100% precision and recall
  • climbing_stairs: 95.24% recall, 90.91% precision
  • drinking: 93.10% recall, 90.00% precision

Challenging Actions Across All Models:

  • biking: The finetuned model completely failed (0% precision/recall), while the pose model achieved 95% for both metrics
  • catching: Struggled across all models, with the pose model performing best
  • shooting: Consistently challenging, with precision around 20-30% across models

Model Availability

The best performing model (Owlet-HAR-1) is now available on Hugging Face: https://huggingface.co/phronetic-ai/owlet-har-1

Analysis: Why Pose Enhancement Didn't Help

The counterintuitive result that pose enhancement decreased performance suggests several possible explanations:

1. Information Redundancy: The base VLM may already be extracting pose-related features implicitly from the visual data. Adding explicit keypoints might not provide new information and could instead create visual clutter.

2. Occlusion Effects: Overlaid pose keypoints might obscure important visual details that the model relies on for classification, particularly for actions involving object manipulation or fine-grained movements.

3. Training Data Quality: The pose detection process might introduce noise or inaccuracies, especially in videos with poor lighting, multiple people, or complex backgrounds.

4. Model Architecture Limitations: The 3B parameter model might not have sufficient capacity to effectively leverage both original visual features and explicit pose information simultaneously.

Future Extensions

Methodological Improvements

  • Test with larger VLM architectures: Investigate whether models with greater parameter counts (7B, 13B+) can better integrate pose information with visual features
  • Temporal modeling enhancements: Develop methods to better capture motion dynamics and temporal relationships in activity sequences

Dataset Expansion

  • Validation on additional action recognition datasets: Test generalization across UCF-101, Kinetics-400, and other standard benchmarks
  • Cross-dataset generalization testing: Evaluate how models trained on one dataset perform on others to assess robustness
  • Multi-person action recognition scenarios: Extend to complex scenes with multiple actors and interactions
  • Domain-specific datasets: Evaluate performance on specialized domains like medical rehabilitation, sports analysis, or industrial safety

Technical Research Directions

  • Pose representation optimization: Investigate alternative pose encodings that minimize visual occlusion while maximizing information content
  • Adaptive pose integration: Develop models that can dynamically decide when pose information is helpful vs. when it should be ignored
  • Real-time optimization: Research efficient architectures suitable for real-time video processing applications
  • Few-shot learning: Explore how quickly these models can adapt to new activity categories with minimal training data

Conclusion

While our study didn't confirm the initial hypothesis that pose enhancement would improve video activity recognition, it provided valuable insights into VLM fine-tuning and the complexities of multimodal learning like VLMs could already be extracting skeletal information inside the videos. and that simply adding more modalities can create information redundancy rather than complementarity. This challenges the conventional wisdom that "more data is better" in multimodal AI, highlighting that effective VLM optimization requires careful consideration of how different information types interact rather than just combining them naively.

 The substantial improvement from fine-tuning (68.19% vs 38.33% accuracy) demonstrates the importance of domain-specific training, while the mixed results from pose enhancement highlight the need for more sophisticated approaches to integrating structural information.

The field of video understanding continues to evolve rapidly, and these results contribute to our understanding of how to effectively leverage Vision Language Models for human activity recognition. As models become more capable and training techniques more sophisticated, we expect to see continued improvements in this challenging but important domain.