Abstract
Despite advancements in computer vision techniques like object detection and segmentation, a significant gap remains in leveraging these technologies for hazard recognition through natural language processing. To address this gap, this paper proposes VQA-RESCon, an approach that combines Visual Question Answering (VQA) and Referring Expression Segmentation (RES) to enhance construction safety analysis. By leveraging the visual grounding capabilities of RES, our method not only identifies potential hazards through VQA but also precisely localizes and highlights these hazards within the image. The method utilizes a large “scenario-questions” dataset comprising 200,000 images and 16 targeted questions to train a vision-and-language transformer model. In addition, post-processing techniques were employed using the ClipSeg and Segment Anything Model. The validation results indicate that both the VQA and RES models demonstrate notable reliability and precision. The VQA model achieves an F1 score surpassing 90%, while the segmentation models achieve a Mean Intersection over Union of 57%.
| Original language | English |
|---|---|
| Article number | 106127 |
| Journal | Automation in Construction |
| Volume | 174 |
| DOIs | |
| State | Published - Jun 2025 |
Keywords
- Construction safety analysis
- Referring Expression Segmentation
- Visual Question Answering
Fingerprint
Dive into the research topics of 'Visual Question Answering-based Referring Expression Segmentation for construction safety analysis'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver