concepts.benchmark.common.vocab.gen_vocab#

gen_vocab(dataset, keys=None, extra_words=None, cls=None, single_word=False)[source]#

Generate a Vocabulary instance from a dataset.

By default, this function will retrieve the data using the get_metainfo function, or it will fall back to dataset[i] if the function does not exist.

The function should return a dictionary. Users can specify a list of keys that will be returned by the get_metainfo function. This function will split the string indexed by these keys and add tokens to the vocabulary. If the argument keys is not specified, this function assumes the return of get_metainfo to be a string.

By default, this function will add four additional tokens: EBD_PAD, EBD_BOS, EBD_EOS, and EBD_UNK. Users can specify additional extra tokens using the extra_words argument.