Recent work has demonstrated that large transformer models can achieve or advance the SOTA in computer vision tasks such as semantic segmentation and object detection. However, unlike convolutional network models that can do it only with the standard public dataset, it takes a proprietary dataset that is magnitudes larger. The recent project VOLO (Vision Outlooker) from SEA AI Lab��
]]>