concepts.vision.fm_match.dino.extractor_dino.ViTExtractor#

class ViTExtractor[source]#

Bases: object

This class facilitates extraction of features, descriptors, and saliency maps from a ViT. We use the following notation in the documentation of the module’s methods: B - batch size h - number of heads. usually takes place of the channel dimension in pytorch’s convention BxCxHxW p - patch size of the ViT. either 8 or 16. t - number of tokens. equals the number of patches + 1, e.g. HW / p**2 + 1. Where H and W are the height and width of the input image. d - the embedding dimension in the ViT.

Methods

`create_model`(model_type)
`extract_descriptors`(batch[, layer, facet, ...])	extract descriptors from the model :param batch: batch to extract descriptors for.
`extract_saliency_maps`(batch)	extract saliency maps.
`patch_vit_resolution`(model, stride)	change resolution of model output by changing the stride of the patch extraction.
`preprocess`(image_path[, load_size, patch_size])	Preprocesses an image before extraction. :param image_path: path to image to be extracted. :param load_size: optional. Size to resize image before the rest of preprocessing. :return: a tuple containing: (1) the preprocessed image as a tensor to insert the model of shape BxCxHxW. (2) the pil image in relevant dimensions.
`preprocess_pil`(pil_image)	Preprocesses an image before extraction. :param image_path: path to image to be extracted. :param load_size: optional. Size to resize image before the rest of preprocessing. :return: a tuple containing: (1) the preprocessed image as a tensor to insert the model of shape BxCxHxW. (2) the pil image in relevant dimensions.

__init__(model_type='dino_vits8', stride=4, model=None, device='cuda')[source]#

Parameters:

model_type (str) – A string specifying the type of model to extract from. [dino_vits8 | dino_vits16 | dino_vitb8 | dino_vitb16 | vit_small_patch8_224 | vit_small_patch16_224 | vit_base_patch8_224 | vit_base_patch16_224]
stride (int) – stride of first convolution layer. small stride -> higher resolution.
model (Module) – Optional parameter. The nn.Module to extract from instead of creating a new one in ViTExtractor. should be compatible with model_type.
device (str)

__new__(**kwargs)#

static create_model(model_type)[source]#

Parameters:: model_type (str) – a string specifying which model to load. [dino_vits8 | dino_vits16 | dino_vitb8 | dino_vitb16 | vit_small_patch8_224 | vit_small_patch16_224 | vit_base_patch8_224 | vit_base_patch16_224]
Returns:: the model
Return type:: Module

extract_descriptors(batch, layer=11, facet='key', bin=False, include_cls=False)[source]#

extract descriptors from the model :param batch: batch to extract descriptors for. Has shape BxCxHxW. :param layers: layer to extract. A number between 0 to 11. :param facet: facet to extract. One of the following options: [‘key’ | ‘query’ | ‘value’ | ‘token’] :param bin: apply log binning to the descriptor. default is False. :return: tensor of descriptors. Bx1xtxd’ where d’ is the dimension of the descriptors.

Parameters:

batch (Tensor)
layer (int)
facet (str)
bin (bool)
include_cls (bool)

Return type:

Tensor

extract_saliency_maps(batch)[source]#

extract saliency maps. The saliency maps are extracted by averaging several attention heads from the last layer in of the CLS token. All values are then normalized to range between 0 and 1. :param batch: batch to extract saliency maps for. Has shape BxCxHxW. :return: a tensor of saliency maps. has shape Bxt-1

Parameters:: batch (Tensor)
Return type:: Tensor

static patch_vit_resolution(model, stride)[source]#

change resolution of model output by changing the stride of the patch extraction. :param model: the model to change resolution for. :param stride: the new stride parameter. :return: the adjusted model

Parameters:

model (Module)
stride (int)

Return type:

Module

preprocess(image_path, load_size=None, patch_size=14)[source]#

Preprocesses an image before extraction. :param image_path: path to image to be extracted. :param load_size: optional. Size to resize image before the rest of preprocessing. :return: a tuple containing:

the preprocessed image as a tensor to insert the model of shape BxCxHxW.

the pil image in relevant dimensions

Parameters:

image_path (str | Path)
load_size (int | Tuple[int, int])
patch_size (int)

Return type:

Tuple[Tensor, Image]

preprocess_pil(pil_image)[source]#

the preprocessed image as a tensor to insert the model of shape BxCxHxW.

the pil image in relevant dimensions