TY - JOUR
T1 - Expectation-Maximization via Pretext-Invariant Representations
AU - Oinar, Chingis
AU - Le, Binh M.
AU - Woo, Simon S.
N1 - Publisher Copyright:
© 2013 IEEE.
PY - 2023
Y1 - 2023
N2 - Contrastive learning methods have been widely adopted in numerous unsupervised and self-supervised visual representation learning methods. Such algorithms aim to maximize the cosine similarity between two positive samples while minimizing that of the negative samples. Recently, Grill et al. propose an algorithm, BYOL, to utilize only positive samples, completely giving up on negative ones, by introducing a Siamese-like asymmetric architecture. Although many recent state-of-the-art (SOTA) methods adopt the architecture, most of them simply introduce the additional neural network, the predictor, without much exploration of the asymmetrical architecture. In contrast, He et al. propose SimSiam, a simple Siamese architecture relying on the stop-gradient operation instead of the momentum encoder and describe the framework from the perspective of Expectation-Maximization. We argue that BYOL-like algorithms attain suboptimal performance due to representation inconsistency during training. In this work, we explain and propose a novel self-supervised objective, Expectation-Maximization via Pretext-Invariant Representations (EMPIR), which enhances Expectation-Maximization-based optimization in BYOL-like algorithms by enforcing augmentation invariance within a local region of k nearest neighbors, resulting in consistent representation learning. In other words, we propose Expectation-Maximization as a core task of asymmetric architectures. We show that it consistently outperforms other (SOTA) algorithms by a decent margin. We also demonstrate its transfer learning capabilities on downstream image recognition tasks.
AB - Contrastive learning methods have been widely adopted in numerous unsupervised and self-supervised visual representation learning methods. Such algorithms aim to maximize the cosine similarity between two positive samples while minimizing that of the negative samples. Recently, Grill et al. propose an algorithm, BYOL, to utilize only positive samples, completely giving up on negative ones, by introducing a Siamese-like asymmetric architecture. Although many recent state-of-the-art (SOTA) methods adopt the architecture, most of them simply introduce the additional neural network, the predictor, without much exploration of the asymmetrical architecture. In contrast, He et al. propose SimSiam, a simple Siamese architecture relying on the stop-gradient operation instead of the momentum encoder and describe the framework from the perspective of Expectation-Maximization. We argue that BYOL-like algorithms attain suboptimal performance due to representation inconsistency during training. In this work, we explain and propose a novel self-supervised objective, Expectation-Maximization via Pretext-Invariant Representations (EMPIR), which enhances Expectation-Maximization-based optimization in BYOL-like algorithms by enforcing augmentation invariance within a local region of k nearest neighbors, resulting in consistent representation learning. In other words, we propose Expectation-Maximization as a core task of asymmetric architectures. We show that it consistently outperforms other (SOTA) algorithms by a decent margin. We also demonstrate its transfer learning capabilities on downstream image recognition tasks.
KW - expectation-maximization
KW - K-nearest neighbors
KW - pretext-invariant representation
KW - Self-supervised learning
UR - https://www.scopus.com/pages/publications/85163434294
U2 - 10.1109/ACCESS.2023.3289589
DO - 10.1109/ACCESS.2023.3289589
M3 - Article
AN - SCOPUS:85163434294
SN - 2169-3536
VL - 11
SP - 65266
EP - 65276
JO - IEEE Access
JF - IEEE Access
ER -