Eine Plattform für die Wissenschaft: Bauingenieurwesen, Architektur und Urbanistik
Multi-scale hybrid vision transformer and Sinkhorn tokenizer for sewer defect classification
Highlights The Multi-Scale Hybrid Vision Transformer is proposed for sewer defect classification. The Sinkhorn tokenizer is proposed for non-local feature aggregation. MSHViT outperforms baseline methods on the Sewer-ML sewer defect dataset. The MSHViT architecture is analyzed in terms of accuracy and efficiency. Visual verification of the non-local interactions, useful for informing sewer inspectors.
Abstract A crucial part of image classification consists of capturing non-local spatial semantics of image content. This paper describes the multi-scale hybrid vision transformer (MSHViT), an extension of the classical convolutional neural network (CNN) backbone, for multi-label sewer defect classification. To better model spatial semantics in the images, features are aggregated at different scales non-locally through the use of a lightweight vision transformer, and a smaller set of tokens was produced through a novel Sinkhorn clustering-based tokenizer using distinct cluster centers. The proposed MSHViT and Sinkhorn tokenizer were evaluated on the Sewer-ML multi-label sewer defect classification dataset, showing consistent performance improvements of up to 2.53 percentage points.
Multi-scale hybrid vision transformer and Sinkhorn tokenizer for sewer defect classification
Highlights The Multi-Scale Hybrid Vision Transformer is proposed for sewer defect classification. The Sinkhorn tokenizer is proposed for non-local feature aggregation. MSHViT outperforms baseline methods on the Sewer-ML sewer defect dataset. The MSHViT architecture is analyzed in terms of accuracy and efficiency. Visual verification of the non-local interactions, useful for informing sewer inspectors.
Abstract A crucial part of image classification consists of capturing non-local spatial semantics of image content. This paper describes the multi-scale hybrid vision transformer (MSHViT), an extension of the classical convolutional neural network (CNN) backbone, for multi-label sewer defect classification. To better model spatial semantics in the images, features are aggregated at different scales non-locally through the use of a lightweight vision transformer, and a smaller set of tokens was produced through a novel Sinkhorn clustering-based tokenizer using distinct cluster centers. The proposed MSHViT and Sinkhorn tokenizer were evaluated on the Sewer-ML multi-label sewer defect classification dataset, showing consistent performance improvements of up to 2.53 percentage points.
Multi-scale hybrid vision transformer and Sinkhorn tokenizer for sewer defect classification
Haurum, Joakim Bruslund (Autor:in) / Madadi, Meysam (Autor:in) / Escalera, Sergio (Autor:in) / Moeslund, Thomas B. (Autor:in)
03.10.2022
Aufsatz (Zeitschrift)
Elektronische Ressource
Englisch
Towards Trustworthy Multi-label Sewer Defect Classification via Evidential Deep Learning
ArXiv | 2022
|Multiple Defect Classification Method for Green Plum Surfaces Based on Vision Transformer
DOAJ | 2023
|