TY - JOUR
T1 - Classification of Sewer Defects Using Point Clouds Based on a Novel Sewer Vision Transformer With Cross-Modal In-Domain Knowledge
AU - Jing, Shuju
AU - Li, Xiangyang
AU - Beyene, Daniel Asefa
AU - Cha, Gichun
AU - Park, Seunghee
N1 - Publisher Copyright:
© 2001-2012 IEEE.
PY - 2025
Y1 - 2025
N2 - The high-precision geometric measurement capabilities of sensor-based point clouds provide significant advantages for sewer defect detection. To enhance the classification of valuable yet data-scarce sewer-defect knowledge within the point cloud community, this study proposes a cross-modal framework that combines self-supervised pretraining with supervised fine-tuning. The proposed sewer vision transformer (Sewer-ViT) integrates key-edge sampling, neighborhood dilation learning, dual-domain feature fusion, and inverted bottleneck structures to reinforce defect feature embedding and inductive bias. These features are subsequently processed by a transformer encoder pretrained with 2-D in-domain knowledge, and the latent representations are further optimized through weight fusion within a unified vector space, thereby improving classification performance. The method achieved average precision, recall, and F1 -scores of 75.87%, 76.73%, and 75.44% on the overall test set and 65.09%, 62.47%, and 62.58% on a real-world test set, respectively—surpassing the existing approaches. These results highlight the practical potential of this method for sewer defect detection and point to a promising future for multimodal fusion research.
AB - The high-precision geometric measurement capabilities of sensor-based point clouds provide significant advantages for sewer defect detection. To enhance the classification of valuable yet data-scarce sewer-defect knowledge within the point cloud community, this study proposes a cross-modal framework that combines self-supervised pretraining with supervised fine-tuning. The proposed sewer vision transformer (Sewer-ViT) integrates key-edge sampling, neighborhood dilation learning, dual-domain feature fusion, and inverted bottleneck structures to reinforce defect feature embedding and inductive bias. These features are subsequently processed by a transformer encoder pretrained with 2-D in-domain knowledge, and the latent representations are further optimized through weight fusion within a unified vector space, thereby improving classification performance. The method achieved average precision, recall, and F1 -scores of 75.87%, 76.73%, and 75.44% on the overall test set and 65.09%, 62.47%, and 62.58% on a real-world test set, respectively—surpassing the existing approaches. These results highlight the practical potential of this method for sewer defect detection and point to a promising future for multimodal fusion research.
KW - Cross-modal learning
KW - point clouds
KW - self-supervised learning (SSL)
KW - sewer-defect classification
KW - vision transformer (ViT)
UR - https://www.scopus.com/pages/publications/105016720625
U2 - 10.1109/JSEN.2025.3609788
DO - 10.1109/JSEN.2025.3609788
M3 - Article
AN - SCOPUS:105016720625
SN - 1530-437X
VL - 25
SP - 40188
EP - 40202
JO - IEEE Sensors Journal
JF - IEEE Sensors Journal
IS - 21
ER -