Abstract
The high-precision geometric measurement capabilities of sensor-based point clouds provide significant advantages for sewer defect detection. To enhance the classification of valuable yet data-scarce sewer-defect knowledge within the point cloud community, this study proposes a cross-modal framework that combines self-supervised pretraining with supervised fine-tuning. The proposed sewer vision transformer (Sewer-ViT) integrates key-edge sampling, neighborhood dilation learning, dual-domain feature fusion, and inverted bottleneck structures to reinforce defect feature embedding and inductive bias. These features are subsequently processed by a transformer encoder pretrained with 2-D in-domain knowledge, and the latent representations are further optimized through weight fusion within a unified vector space, thereby improving classification performance. The method achieved average precision, recall, and F1 -scores of 75.87%, 76.73%, and 75.44% on the overall test set and 65.09%, 62.47%, and 62.58% on a real-world test set, respectively—surpassing the existing approaches. These results highlight the practical potential of this method for sewer defect detection and point to a promising future for multimodal fusion research.
| Original language | English |
|---|---|
| Pages (from-to) | 40188-40202 |
| Number of pages | 15 |
| Journal | IEEE Sensors Journal |
| Volume | 25 |
| Issue number | 21 |
| DOIs | |
| State | Published - 2025 |
Keywords
- Cross-modal learning
- point clouds
- self-supervised learning (SSL)
- sewer-defect classification
- vision transformer (ViT)
Fingerprint
Dive into the research topics of 'Classification of Sewer Defects Using Point Clouds Based on a Novel Sewer Vision Transformer With Cross-Modal In-Domain Knowledge'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver