libcudf  24.02.00
Classes | Enumerations | Functions
nvtext Namespace Reference

NVText APIs. More...

Classes

struct  bpe_merge_pairs
 The table of merge pairs for the BPE encoder. More...
 
struct  hashed_vocabulary
 The vocabulary data for use with the subword_tokenize function. More...
 
struct  tokenizer_result
 Result object for the subword_tokenize functions. More...
 
struct  tokenize_vocabulary
 Vocabulary object to be used with nvtext::tokenize_with_vocabulary. More...
 

Enumerations

enum class  letter_type { CONSONANT , VOWEL }
 Used for specifying letter type to check. More...
 

Functions

std::unique_ptr< bpe_merge_pairsload_merge_pairs (cudf::strings_column_view const &merge_pairs, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource())
 Create a nvtext::bpe_merge_pairs from a strings column. More...
 
std::unique_ptr< cudf::columnbyte_pair_encoding (cudf::strings_column_view const &input, bpe_merge_pairs const &merges_pairs, cudf::string_scalar const &separator=cudf::string_scalar(" "), rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource())
 Byte pair encode the input strings. More...
 
std::unique_ptr< cudf::columnedit_distance (cudf::strings_column_view const &input, cudf::strings_column_view const &targets, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource())
 Compute the edit distance between individual strings in two strings columns. More...
 
std::unique_ptr< cudf::columnedit_distance_matrix (cudf::strings_column_view const &input, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource())
 Compute the edit distance between all the strings in the input column. More...
 
std::unique_ptr< cudf::columngenerate_ngrams (cudf::strings_column_view const &input, cudf::size_type ngrams, cudf::string_scalar const &separator, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource())
 Returns a single column of strings by generating ngrams from a strings column. More...
 
std::unique_ptr< cudf::columngenerate_character_ngrams (cudf::strings_column_view const &input, cudf::size_type ngrams=2, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource())
 Generates ngrams of characters within each string. More...
 
std::unique_ptr< cudf::columnhash_character_ngrams (cudf::strings_column_view const &input, cudf::size_type ngrams=5, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource())
 Hashes ngrams of characters within each string. More...
 
std::unique_ptr< cudf::columnjaccard_index (cudf::strings_column_view const &input1, cudf::strings_column_view const &input2, cudf::size_type width, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource())
 Computes the Jaccard similarity between individual rows in two strings columns. More...
 
std::unique_ptr< cudf::columnminhash (cudf::strings_column_view const &input, cudf::numeric_scalar< uint32_t > seed=0, cudf::size_type width=4, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource())
 Returns the minhash value for each string. More...
 
std::unique_ptr< cudf::columnminhash (cudf::strings_column_view const &input, cudf::device_span< uint32_t const > seeds, cudf::size_type width=4, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource())
 Returns the minhash values for each string per seed. More...
 
std::unique_ptr< cudf::columnminhash64 (cudf::strings_column_view const &input, cudf::numeric_scalar< uint64_t > seed=0, cudf::size_type width=4, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource())
 Returns the minhash value for each string. More...
 
std::unique_ptr< cudf::columnminhash64 (cudf::strings_column_view const &input, cudf::device_span< uint64_t const > seeds, cudf::size_type width=4, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource())
 Returns the minhash values for each string per seed. More...
 
std::unique_ptr< cudf::columnngrams_tokenize (cudf::strings_column_view const &input, cudf::size_type ngrams, cudf::string_scalar const &delimiter, cudf::string_scalar const &separator, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource())
 Returns a single column of strings by tokenizing the input strings column and then producing ngrams of each string. More...
 
std::unique_ptr< cudf::columnnormalize_spaces (cudf::strings_column_view const &input, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource())
 Returns a new strings column by normalizing the whitespace in each string in the input column. More...
 
std::unique_ptr< cudf::columnnormalize_characters (cudf::strings_column_view const &input, bool do_lower_case, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource())
 Normalizes strings characters for tokenizing. More...
 
std::unique_ptr< cudf::columnreplace_tokens (cudf::strings_column_view const &input, cudf::strings_column_view const &targets, cudf::strings_column_view const &replacements, cudf::string_scalar const &delimiter=cudf::string_scalar{""}, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource())
 Replaces specified tokens with corresponding replacement strings. More...
 
std::unique_ptr< cudf::columnfilter_tokens (cudf::strings_column_view const &input, cudf::size_type min_token_length, cudf::string_scalar const &replacement=cudf::string_scalar{""}, cudf::string_scalar const &delimiter=cudf::string_scalar{""}, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource())
 Removes tokens whose lengths are less than a specified number of characters. More...
 
std::unique_ptr< cudf::columnis_letter (cudf::strings_column_view const &input, letter_type ltype, cudf::size_type character_index, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource())
 Returns boolean column indicating if character_index of the input strings is a consonant or vowel. More...
 
std::unique_ptr< cudf::columnis_letter (cudf::strings_column_view const &input, letter_type ltype, cudf::column_view const &indices, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource())
 Returns boolean column indicating if character at indices[i] of input[i] is a consonant or vowel. More...
 
std::unique_ptr< cudf::columnporter_stemmer_measure (cudf::strings_column_view const &input, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource())
 Returns the Porter Stemmer measurements of a strings column. More...
 
std::unique_ptr< hashed_vocabularyload_vocabulary_file (std::string const &filename_hashed_vocabulary, rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource())
 Load the hashed vocabulary file into device memory. More...
 
tokenizer_result subword_tokenize (cudf::strings_column_view const &strings, hashed_vocabulary const &vocabulary_table, uint32_t max_sequence_length, uint32_t stride, bool do_lower_case, bool do_truncate, rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource())
 Creates a tokenizer that cleans the text, splits it into tokens and returns token-ids from an input vocabulary. More...
 
std::unique_ptr< cudf::columntokenize (cudf::strings_column_view const &input, cudf::string_scalar const &delimiter=cudf::string_scalar{""}, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource())
 Returns a single column of strings by tokenizing the input strings column using the provided characters as delimiters. More...
 
std::unique_ptr< cudf::columntokenize (cudf::strings_column_view const &input, cudf::strings_column_view const &delimiters, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource())
 Returns a single column of strings by tokenizing the input strings column using multiple strings as delimiters. More...
 
std::unique_ptr< cudf::columncount_tokens (cudf::strings_column_view const &input, cudf::string_scalar const &delimiter=cudf::string_scalar{""}, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource())
 Returns the number of tokens in each string of a strings column. More...
 
std::unique_ptr< cudf::columncount_tokens (cudf::strings_column_view const &input, cudf::strings_column_view const &delimiters, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource())
 Returns the number of tokens in each string of a strings column by using multiple strings delimiters to identify tokens in each string. More...
 
std::unique_ptr< cudf::columncharacter_tokenize (cudf::strings_column_view const &input, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource())
 Returns a single column of strings by converting each character to a string. More...
 
std::unique_ptr< cudf::columndetokenize (cudf::strings_column_view const &input, cudf::column_view const &row_indices, cudf::string_scalar const &separator=cudf::string_scalar(" "), rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource())
 Creates a strings column from a strings column of tokens and an associated column of row ids. More...
 
std::unique_ptr< tokenize_vocabularyload_vocabulary (cudf::strings_column_view const &input, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource())
 Create a tokenize_vocabulary object from a strings column. More...
 
std::unique_ptr< cudf::columntokenize_with_vocabulary (cudf::strings_column_view const &input, tokenize_vocabulary const &vocabulary, cudf::string_scalar const &delimiter, cudf::size_type default_id=-1, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource())
 Returns the token ids for the input string by looking up each delimited token in the given vocabulary. More...
 

Detailed Description

NVText APIs.