Skip to main navigation Skip to search Skip to main content

Patch-level Representation Learning for Self-supervised Vision Transformers

  • Korea Advanced Institute of Science and Technology

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Recent self-supervised learning (SSL) methods have shown impressive results in learning visual representations from unlabeled images. This paper aims to improve their performance further by utilizing the architectural advan-tages of the underlying neural network, as the current state-of-the-art visual pretext tasks for SSL do not enjoy the ben-efit, i.e., they are architecture-agnostic. In particular, we fo-cus on Vision Transformers (ViTs), which have gained much attention recently as a better architectural choice, often out-performing convolutional networks for various visual tasks. The unique characteristic of ViT is that it takes a sequence of disjoint patches from an image and processes patch-level representations internally. Inspired by this, we design a simple yet effective visual pretext task, coined Self Patch, for learning better patch-level representations. To be specific, we enforce invariance against each patch and its neigh-bors, i.e., each patch treats similar neighboring patches as positive samples. Consequently, training ViTs with Self-Patch learns more semantically meaningful relations among patches (without using human-annotated labels), which can be beneficial, in particular, to downstream tasks of a dense prediction type. Despite its simplicity, we demonstrate that it can significantly improve the performance of existing SSL methods for various visual tasks, including object detection and semantic segmentation. Specifically, Self Patch signif-icantly improves the recent self-supervised ViT, DINO, by achieving +1.3 AP on COCO object detection, +1.2 AP on COCO instance segmentation, and +2.9 mIoU on ADE20K semantic segmentation.

Original languageEnglish
Title of host publicationProceedings - 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022
PublisherIEEE Computer Society
Pages8344-8353
Number of pages10
ISBN (Electronic)9781665469463
DOIs
StatePublished - 2022
Externally publishedYes
Event2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022 - New Orleans, United States
Duration: 19 Jun 202224 Jun 2022

Publication series

NameProceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
Volume2022-June
ISSN (Print)1063-6919

Conference

Conference2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022
Country/TerritoryUnited States
CityNew Orleans
Period19/06/2224/06/22

Keywords

  • Self-& semi-& meta- & unsupervised learning

Fingerprint

Dive into the research topics of 'Patch-level Representation Learning for Self-supervised Vision Transformers'. Together they form a unique fingerprint.

Cite this