concepts.vision.fm_match.dino.extractor_dino.ViTExtractor#
- class ViTExtractor[source]#
Bases:
object
This class facilitates extraction of features, descriptors, and saliency maps from a ViT. We use the following notation in the documentation of the module’s methods: B - batch size h - number of heads. usually takes place of the channel dimension in pytorch’s convention BxCxHxW p - patch size of the ViT. either 8 or 16. t - number of tokens. equals the number of patches + 1, e.g. HW / p**2 + 1. Where H and W are the height and width of the input image. d - the embedding dimension in the ViT.
Methods
create_model
(model_type)extract_descriptors
(batch[, layer, facet, ...])extract descriptors from the model :param batch: batch to extract descriptors for.
extract_saliency_maps
(batch)extract saliency maps.
patch_vit_resolution
(model, stride)change resolution of model output by changing the stride of the patch extraction.
preprocess
(image_path[, load_size, patch_size])Preprocesses an image before extraction. :param image_path: path to image to be extracted. :param load_size: optional. Size to resize image before the rest of preprocessing. :return: a tuple containing: (1) the preprocessed image as a tensor to insert the model of shape BxCxHxW. (2) the pil image in relevant dimensions.
preprocess_pil
(pil_image)Preprocesses an image before extraction. :param image_path: path to image to be extracted. :param load_size: optional. Size to resize image before the rest of preprocessing. :return: a tuple containing: (1) the preprocessed image as a tensor to insert the model of shape BxCxHxW. (2) the pil image in relevant dimensions.
- __init__(model_type='dino_vits8', stride=4, model=None, device='cuda')[source]#
- Parameters:
model_type (str) – A string specifying the type of model to extract from. [dino_vits8 | dino_vits16 | dino_vitb8 | dino_vitb16 | vit_small_patch8_224 | vit_small_patch16_224 | vit_base_patch8_224 | vit_base_patch16_224]
stride (int) – stride of first convolution layer. small stride -> higher resolution.
model (Module) – Optional parameter. The nn.Module to extract from instead of creating a new one in ViTExtractor. should be compatible with model_type.
device (str)
- __new__(**kwargs)#
- extract_descriptors(batch, layer=11, facet='key', bin=False, include_cls=False)[source]#
extract descriptors from the model :param batch: batch to extract descriptors for. Has shape BxCxHxW. :param layers: layer to extract. A number between 0 to 11. :param facet: facet to extract. One of the following options: [‘key’ | ‘query’ | ‘value’ | ‘token’] :param bin: apply log binning to the descriptor. default is False. :return: tensor of descriptors. Bx1xtxd’ where d’ is the dimension of the descriptors.
- extract_saliency_maps(batch)[source]#
extract saliency maps. The saliency maps are extracted by averaging several attention heads from the last layer in of the CLS token. All values are then normalized to range between 0 and 1. :param batch: batch to extract saliency maps for. Has shape BxCxHxW. :return: a tensor of saliency maps. has shape Bxt-1
- static patch_vit_resolution(model, stride)[source]#
change resolution of model output by changing the stride of the patch extraction. :param model: the model to change resolution for. :param stride: the new stride parameter. :return: the adjusted model
- preprocess(image_path, load_size=None, patch_size=14)[source]#
Preprocesses an image before extraction. :param image_path: path to image to be extracted. :param load_size: optional. Size to resize image before the rest of preprocessing. :return: a tuple containing:
the preprocessed image as a tensor to insert the model of shape BxCxHxW.
the pil image in relevant dimensions
- preprocess_pil(pil_image)[source]#
Preprocesses an image before extraction. :param image_path: path to image to be extracted. :param load_size: optional. Size to resize image before the rest of preprocessing. :return: a tuple containing:
the preprocessed image as a tensor to insert the model of shape BxCxHxW.
the pil image in relevant dimensions