concepts.vision.fm_match.dino.extractor_dino.ViTExtractor#

class ViTExtractor[source]#

Bases: object

This class facilitates extraction of features, descriptors, and saliency maps from a ViT. We use the following notation in the documentation of the module’s methods: B - batch size h - number of heads. usually takes place of the channel dimension in pytorch’s convention BxCxHxW p - patch size of the ViT. either 8 or 16. t - number of tokens. equals the number of patches + 1, e.g. HW / p**2 + 1. Where H and W are the height and width of the input image. d - the embedding dimension in the ViT.

Methods

create_model(model_type)

extract_descriptors(batch[, layer, facet, ...])

extract descriptors from the model :param batch: batch to extract descriptors for.

extract_saliency_maps(batch)

extract saliency maps.

patch_vit_resolution(model, stride)

change resolution of model output by changing the stride of the patch extraction.

preprocess(image_path[, load_size, patch_size])

Preprocesses an image before extraction. :param image_path: path to image to be extracted. :param load_size: optional. Size to resize image before the rest of preprocessing. :return: a tuple containing: (1) the preprocessed image as a tensor to insert the model of shape BxCxHxW. (2) the pil image in relevant dimensions.

preprocess_pil(pil_image)

Preprocesses an image before extraction. :param image_path: path to image to be extracted. :param load_size: optional. Size to resize image before the rest of preprocessing. :return: a tuple containing: (1) the preprocessed image as a tensor to insert the model of shape BxCxHxW. (2) the pil image in relevant dimensions.

__init__(model_type='dino_vits8', stride=4, model=None, device='cuda')[source]#
Parameters:
  • model_type (str) – A string specifying the type of model to extract from. [dino_vits8 | dino_vits16 | dino_vitb8 | dino_vitb16 | vit_small_patch8_224 | vit_small_patch16_224 | vit_base_patch8_224 | vit_base_patch16_224]

  • stride (int) – stride of first convolution layer. small stride -> higher resolution.

  • model (Module) – Optional parameter. The nn.Module to extract from instead of creating a new one in ViTExtractor. should be compatible with model_type.

  • device (str)

__new__(**kwargs)#
static create_model(model_type)[source]#
Parameters:

model_type (str) – a string specifying which model to load. [dino_vits8 | dino_vits16 | dino_vitb8 | dino_vitb16 | vit_small_patch8_224 | vit_small_patch16_224 | vit_base_patch8_224 | vit_base_patch16_224]

Returns:

the model

Return type:

Module

extract_descriptors(batch, layer=11, facet='key', bin=False, include_cls=False)[source]#

extract descriptors from the model :param batch: batch to extract descriptors for. Has shape BxCxHxW. :param layers: layer to extract. A number between 0 to 11. :param facet: facet to extract. One of the following options: [‘key’ | ‘query’ | ‘value’ | ‘token’] :param bin: apply log binning to the descriptor. default is False. :return: tensor of descriptors. Bx1xtxd’ where d’ is the dimension of the descriptors.

Parameters:
Return type:

Tensor

extract_saliency_maps(batch)[source]#

extract saliency maps. The saliency maps are extracted by averaging several attention heads from the last layer in of the CLS token. All values are then normalized to range between 0 and 1. :param batch: batch to extract saliency maps for. Has shape BxCxHxW. :return: a tensor of saliency maps. has shape Bxt-1

Parameters:

batch (Tensor)

Return type:

Tensor

static patch_vit_resolution(model, stride)[source]#

change resolution of model output by changing the stride of the patch extraction. :param model: the model to change resolution for. :param stride: the new stride parameter. :return: the adjusted model

Parameters:
Return type:

Module

preprocess(image_path, load_size=None, patch_size=14)[source]#

Preprocesses an image before extraction. :param image_path: path to image to be extracted. :param load_size: optional. Size to resize image before the rest of preprocessing. :return: a tuple containing:

  1. the preprocessed image as a tensor to insert the model of shape BxCxHxW.

  2. the pil image in relevant dimensions

Parameters:
Return type:

Tuple[Tensor, Image]

preprocess_pil(pil_image)[source]#

Preprocesses an image before extraction. :param image_path: path to image to be extracted. :param load_size: optional. Size to resize image before the rest of preprocessing. :return: a tuple containing:

  1. the preprocessed image as a tensor to insert the model of shape BxCxHxW.

  2. the pil image in relevant dimensions