libcudf  24.04.00
Public Attributes | List of all members
nvtext::hashed_vocabulary Struct Reference

The vocabulary data for use with the subword_tokenize function. More...

#include <subword_tokenize.hpp>

Public Attributes

uint16_t first_token_id {}
 The first token id in the vocabulary.
 
uint16_t separator_token_id {}
 The separator token id in the vocabulary.
 
uint16_t unknown_token_id {}
 The unknown token id in the vocabulary.
 
uint32_t outer_hash_a {}
 The a parameter for the outer hash.
 
uint32_t outer_hash_b {}
 The b parameter for the outer hash.
 
uint16_t num_bins {}
 Number of bins.
 
std::unique_ptr< cudf::columntable
 
std::unique_ptr< cudf::columnbin_coefficients
 
std::unique_ptr< cudf::columnbin_offsets
 
std::unique_ptr< cudf::columncp_metadata
 uint32 column, The code point metadata table to use for normalization
 
std::unique_ptr< cudf::columnaux_cp_table
 uint64 column, The auxiliary code point table to use for normalization
 

Detailed Description

The vocabulary data for use with the subword_tokenize function.

Definition at line 33 of file subword_tokenize.hpp.

Member Data Documentation

◆ bin_coefficients

std::unique_ptr<cudf::column> nvtext::hashed_vocabulary::bin_coefficients

uint64 column, containing the hashing parameters for each hash bin on the GPU

Definition at line 42 of file subword_tokenize.hpp.

◆ bin_offsets

std::unique_ptr<cudf::column> nvtext::hashed_vocabulary::bin_offsets

uint16 column, containing the start index of each bin in the flattened hash table

Definition at line 44 of file subword_tokenize.hpp.

◆ table

std::unique_ptr<cudf::column> nvtext::hashed_vocabulary::table

uint64 column, the flattened hash table with key, value pairs packed in 64-bits

Definition at line 40 of file subword_tokenize.hpp.


The documentation for this struct was generated from the following file: