Segmentation using CLIP
Zero-shot segmentation using large pre-trained language-image models like CLIP In this project, we explored language-driven zero-shot semantic segmentation using large pre-trained language-vision classification models like CLIP. We made changes in the vision branch of CLIP, which includes architectures like ResNets and Vision Transformers (ViT) to be replaced with other segmentation based architectures like PSPNet, DeepLab, DPT. We have used PSPNet along with CLIP’s text transformer (frozen). This gives us a good starting point for results. However, with just this, the segmentation maps are blobby and the boundaries are not well-defined, though the classes are in their correct approximate locations. This is because the image encoder is tied to the text encoder’s embeddings (semantically) because of training. We can resolve this by adding PSPNet without removing the CLIP image encoder and using CLIP’s maps as pseudo-labels for training our segmentation model. We are testing out methods to improve this. ...