Introduction
The 5th PVUW challenge will be held in conjunction with CVPR 2026 in Denver CO. Pixel-level Scene Understanding is one of the fundamental problems in computer vision, which aims at recognizing object classes, masks and semantics of each pixel in the given image. Since the real-world is actually video-based rather than a static state, learning to perform video segmentation is more reasonable and practical for realistic applications. To advance the segmentation task from images to videos, we will present new datasets and competitions in this workshop, aiming at performing the challenging yet practical Pixel-level Video Understanding in the Wild (PVUW). This workshop will cover but not limit to the following topics:
- Semantic/panoptic segmentation for images/videos
- Referring image/video comprehension/segmentation
- Video object/instance segmentation
- Video understanding in complex environments
- Language-guided video understanding
- Audio-guided video segmentation
- Efficient computation for video scene parsing
- Semi-supervised recognition in videos
- New metrics to evaluate the quality of video scene parsing results
- Real-world video applications, e.g., autonomous driving, indoor robotics, visual navigation, etc.
Call for Paper
[Update] PVUW 2026 will be using [OpenReview] to manage submissions. We are looking forward to your work and engaging discussions at the workshop!
We invite authors to submit unpublished papers (8-page CVPR format) to our workshop, to be presented at a poster session upon acceptance. All submissions will go through a double-blind review process. All contributions must be submitted (along with supplementary materials, if any) through the paper submission portal.
Accepted papers will be published in the official CVPR Workshops proceedings and the Computer Vision Foundation (CVF) Open Access archive.
Challenge Tracks & Submission
Track 1: Complex Video Object Segmentation (MOSEv2) Track
MOSEv2 aims to track and segment objects in videos of complex environments. MOSEv2 submission server [click here].
Track 2: Text-based Referring Motion Expression Video Segmentation (MeViSv2 - Text) Track
MeViS-Text focuses on segmenting objects in video based on a sentence describing the motion of the objects. MeViS-Text submission server [click here].
Track 3: Audio-based Referring Motion Expression Video Segmentation (MeViSv2 - Audio) Track
MeViS-Audio focuses on segmenting objects in video based on a audio clip describing the motion of the objects. MeViS-Audio submission server [click here].
Challenge Timeline
| Event | Date |
|---|---|
| Challenge release | Feb 15, 2026 |
| Validation server online | Feb 16, 2026 |
| Test server online | Mar 9, 2026 |
| Submission deadline | Mar 14, 2026 |
| Notification | Mar 16, 2026 |
Paper Submission Timeline
| Event | Date |
|---|---|
| Regular paper submission deadline | Mar 13, 2026 |
| Supplemental material deadline | Mar 13, 2026 |
| Notification of paper acceptance | Mar 19, 2026 |
| Challenge paper submission deadline | Mar 20, 2026 |
| Camera ready deadline | Apr 7, 2026 |
Organizers

Henghui Ding
Fudan University
Nikhila Ravi
META AI
Chang Liu
Nanyang Technological University
Yunchao Wei
Beijing Jiaotong University
Jiaxu Miao
Sun Yat-Sen University
Shuting He
Shanghai University of Finance and Economics
Leilei Cao
Transsion
Zongxin Yang
Harvard University
Yi Yang
Zhejiang University
Si Liu
Beihang University
Yi Zhu
Amazon
Elisa Ricci
University of Trento
Cees Snoek
University of Amsterdam
Song Bai
ByteDance
Philip Torr
University of OxfordContact
Feel free to contact us:
henghui.ding@gmail.com
changliu73@outlook.com