Does Your ViT Still Need U-Net for Segmentation?

arXiv:2607.00223v1 Announce Type: new Abstract: Medical image segmentation is dominated by U-Net-style encoder-decoder architectures. Vision Transformers (ViTs) overcome the limited receptive field of convolutional networks through self-attention, enabling modeling of long-range dependencies. Early ViT-based segmentation methods typically retained U-Net-style decoders because pretrained ViT representations were insufficient to support accurate dense prediction. Recent advances in large-scale pre...

arXiv cs.CV ·Xin Li, Wenhui Zhu, Xuanzhao Dong, Xiwen Chen, Yanxi Chen, Yujian Xiong, Hao Wang, Oana M. Dumitrascu, Yalin Wang ·
compartilhar: