Does Your ViT Still Need U-Net for Segmentation?
arXiv:2607.00223v1 Announce Type: new Abstract: Medical image segmentation is dominated by U-Net-style encoder-decoder architectures. Vision Transformers (ViTs) overcome the limited receptive field of convolutional networks through self-attention, enabling modeling of long-range dependencies. Early ViT-based segmentation methods typically retained U-Net-style decoders because pretrained ViT representations were insufficient to support accurate dense prediction. Recent advances in large-scale pre...
arXiv cs.CV
·Xin Li, Wenhui Zhu, Xuanzhao Dong, Xiwen Chen, Yanxi Chen, Yujian Xiong, Hao Wang, Oana M. Dumitrascu, Yalin Wang
·
// relacionados
Leia também
Blog
Stop Pretending Social Robots Are Inevitable
Blog
Segmenting, Fast and Slow: Real-Time Open-Vocabulary Video Instance Segmentation with Dual-Path Processing
Blog
Decompose, Compare, and Decide: Multimodal LLMs are Implicit Few-Shot Learners
Blog