Pixel-level Video Understanding in the Wild

Workshop in conjunction with CVPR 2026

Wed June 3 - Sun June 7, 2026

Denver CO

Introduction

The 5th PVUW challenge will be held in conjunction with CVPR 2026 in Denver CO. Pixel-level Scene Understanding is one of the fundamental problems in computer vision, which aims at recognizing object classes, masks and semantics of each pixel in the given image. Since the real-world is actually video-based rather than a static state, learning to perform video segmentation is more reasonable and practical for realistic applications. To advance the segmentation task from images to videos, we will present new datasets and competitions in this workshop, aiming at performing the challenging yet practical Pixel-level Video Understanding in the Wild (PVUW). This workshop will cover but not limit to the following topics:

  • Semantic/panoptic segmentation for images/videos
  • Referring image/video comprehension/segmentation
  • Video object/instance segmentation
  • Video understanding in complex environments
  • Language-guided video understanding
  • Audio-guided video segmentation
  • Efficient computation for video scene parsing
  • Semi-supervised recognition in videos
  • New metrics to evaluate the quality of video scene parsing results
  • Real-world video applications, e.g., autonomous driving, indoor robotics, visual navigation, etc.

Call for Paper

[Update] PVUW 2026 will be using [OpenReview] to manage submissions. We are looking forward to your work and engaging discussions at the workshop!


We invite authors to submit unpublished papers (8-page CVPR format) to our workshop, to be presented at a poster session upon acceptance. All submissions will go through a double-blind review process. All contributions must be submitted (along with supplementary materials, if any) through the paper submission portal.


Accepted papers will be published in the official CVPR Workshops proceedings and the Computer Vision Foundation (CVF) Open Access archive.

Challenge Tracks & Submission


Track 1: Complex Video Object Segmentation (MOSEv2) Track

MOSEv2 aims to track and segment objects in videos of complex environments. MOSEv2 submission server [click here].

Track 2: Text-based Referring Motion Expression Video Segmentation (MeViSv2 - Text) Track

MeViS-Text focuses on segmenting objects in video based on a sentence describing the motion of the objects. MeViS-Text submission server [click here].

Track 3: Audio-based Referring Motion Expression Video Segmentation (MeViSv2 - Audio) Track

MeViS-Audio focuses on segmenting objects in video based on a audio clip describing the motion of the objects. MeViS-Audio submission server [click here].

Challenge Timeline

Event Date
Challenge release Feb 15, 2026
Validation server online Feb 16, 2026
Test server online Mar 9, 2026
Submission deadline Mar 14, 2026
Notification Mar 16, 2026
*All dates are in UTC, 23:59 of the specified day.

Paper Submission Timeline

Event Date
Regular paper submission deadline Mar 13, 2026
Supplemental material deadline Mar 13, 2026
Notification of paper acceptance Mar 19, 2026
Challenge paper submission deadline Mar 20, 2026
Camera ready deadline Apr 7, 2026
*All dates are in UTC, 23:59 of the specified day.

Organizers

Henghui Ding

Fudan University

Nikhila Ravi

META AI

Chang Liu

Nanyang Technological University

Yunchao Wei

Beijing Jiaotong University

Jiaxu Miao

Sun Yat-Sen University

Shuting He

Shanghai University of Finance and Economics

Leilei Cao

Transsion

Zongxin Yang

Harvard University

Yi Yang

Zhejiang University

Si Liu

Beihang University

Yi Zhu

Amazon

Elisa Ricci

University of Trento

Cees Snoek

University of Amsterdam

Song Bai

ByteDance

Philip Torr

University of Oxford

Contact