libcudf  24.04.00
Files | Functions
NGrams

Files

file  generate_ngrams.hpp
 
file  ngrams_tokenize.hpp
 

Functions

std::unique_ptr< cudf::columnnvtext::generate_ngrams (cudf::strings_column_view const &input, cudf::size_type ngrams, cudf::string_scalar const &separator, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource())
 Returns a single column of strings by generating ngrams from a strings column. More...
 
std::unique_ptr< cudf::columnnvtext::generate_character_ngrams (cudf::strings_column_view const &input, cudf::size_type ngrams=2, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource())
 Generates ngrams of characters within each string. More...
 
std::unique_ptr< cudf::columnnvtext::hash_character_ngrams (cudf::strings_column_view const &input, cudf::size_type ngrams=5, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource())
 Hashes ngrams of characters within each string. More...
 
std::unique_ptr< cudf::columnnvtext::ngrams_tokenize (cudf::strings_column_view const &input, cudf::size_type ngrams, cudf::string_scalar const &delimiter, cudf::string_scalar const &separator, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource())
 Returns a single column of strings by tokenizing the input strings column and then producing ngrams of each string. More...
 

Detailed Description

Function Documentation

◆ generate_character_ngrams()

std::unique_ptr<cudf::column> nvtext::generate_character_ngrams ( cudf::strings_column_view const &  input,
cudf::size_type  ngrams = 2,
rmm::cuda_stream_view  stream = cudf::get_default_stream(),
rmm::mr::device_memory_resource *  mr = rmm::mr::get_current_device_resource() 
)

Generates ngrams of characters within each string.

Each character of a string used to build ngrams. Ngrams are not created across strings.

["ab", "cde", "fgh"] would generate bigrams as ["ab", "cd", "de", "fg", "gh"]

The size of the output column will be the total number of ngrams generated from the input strings column.

All null row entries are ignored and the output contains all valid rows.

Exceptions
cudf::logic_errorif ngrams < 2
cudf::logic_errorif there are not enough characters to generate any ngrams
Parameters
inputStrings column to produce ngrams from
ngramsThe ngram number to generate. Default is 2 = bigram.
streamCUDA stream used for device memory operations and kernel launches
mrDevice memory resource used to allocate the returned column's device memory
Returns
New strings columns of tokens

◆ generate_ngrams()

std::unique_ptr<cudf::column> nvtext::generate_ngrams ( cudf::strings_column_view const &  input,
cudf::size_type  ngrams,
cudf::string_scalar const &  separator,
rmm::cuda_stream_view  stream = cudf::get_default_stream(),
rmm::mr::device_memory_resource *  mr = rmm::mr::get_current_device_resource() 
)

Returns a single column of strings by generating ngrams from a strings column.

An ngram is a grouping of 2 or more strings with a separator. For example, generating bigrams groups all adjacent pairs of strings.

["a", "bb", "ccc"] would generate bigrams as ["a_bb", "bb_ccc"]
and trigrams as ["a_bb_ccc"]

The size of the output column will be the total number of ngrams generated from the input strings column.

All null row entries are ignored and the output contains all valid rows.

Exceptions
cudf::logic_errorif ngrams < 2
cudf::logic_errorif separator is invalid
cudf::logic_errorif there are not enough strings to generate any ngrams
Parameters
inputStrings column to tokenize and produce ngrams from
ngramsThe ngram number to generate
separatorThe string to use for separating ngram tokens
streamCUDA stream used for device memory operations and kernel launches
mrDevice memory resource used to allocate the returned column's device memory
Returns
New strings columns of tokens

◆ hash_character_ngrams()

std::unique_ptr<cudf::column> nvtext::hash_character_ngrams ( cudf::strings_column_view const &  input,
cudf::size_type  ngrams = 5,
rmm::cuda_stream_view  stream = cudf::get_default_stream(),
rmm::mr::device_memory_resource *  mr = rmm::mr::get_current_device_resource() 
)

Hashes ngrams of characters within each string.

Each character of a string used to build the ngrams and ngrams are not produced across adjacent strings rows.

"abcdefg" would generate ngrams=5 as ["abcde", "bcdef" "cdefg"]

The ngrams for each string are hashed and returned in a list column where the offsets specify rows of hash values for each string.

The size of the child column will be the total number of ngrams generated from the input strings column.

All null row entries are ignored and the output contains all valid rows.

The hash algorithm uses MurmurHash32 on each ngram.

Exceptions
cudf::logic_errorif ngrams < 2
cudf::logic_errorif there are not enough characters to generate any ngrams
Parameters
inputStrings column to produce ngrams from
ngramsThe ngram number to generate. Default is 5.
streamCUDA stream used for device memory operations and kernel launches
mrDevice memory resource used to allocate the returned column's device memory.
Returns
A lists column of hash values

◆ ngrams_tokenize()

std::unique_ptr<cudf::column> nvtext::ngrams_tokenize ( cudf::strings_column_view const &  input,
cudf::size_type  ngrams,
cudf::string_scalar const &  delimiter,
cudf::string_scalar const &  separator,
rmm::cuda_stream_view  stream = cudf::get_default_stream(),
rmm::mr::device_memory_resource *  mr = rmm::mr::get_current_device_resource() 
)

Returns a single column of strings by tokenizing the input strings column and then producing ngrams of each string.

An ngram is a grouping of 2 or more tokens with a separator. For example, generating bigrams groups all adjacent pairs of tokens for a string.

["a bb ccc"] can be tokenized to ["a", "bb", "ccc"]
bigrams would generate ["a_bb", "bb_ccc"] and trigrams would generate ["a_bb_ccc"]

The delimiter is used for tokenizing and may be zero or more characters. If the delimiter is empty, whitespace (character code-point <= ' ') is used for identifying tokens.

Once tokens are identified, ngrams are produced by joining the tokens with the specified separator. The generated ngrams use the tokens for each string and not across strings in adjacent rows. Any input string that contains fewer tokens than the specified ngrams value is skipped and will not contribute to the output. Therefore, a bigram of a single token is ignored as well as a trigram of 2 or less tokens.

Tokens are found by locating delimiter(s) starting at the beginning of each string. As each string is tokenized, the ngrams are generated using input column row order to build the output column. That is, ngrams created in input row[i] will be placed in the output column directly before ngrams created in input row[i+1].

The size of the output column will be the total number of ngrams generated from the input strings column.

Example:
s = ["a b c", "d e", "f g h i", "j"]
t = ngrams_tokenize(s, 2, " ", "_")
t is now ["a_b", "b_c", "d_e", "f_g", "g_h", "h_i"]

All null row entries are ignored and the output contains all valid rows.

Parameters
inputStrings column to tokenize and produce ngrams from
ngramsThe ngram number to generate
delimiterUTF-8 characters used to separate each string into tokens. An empty string will separate tokens using whitespace.
separatorThe string to use for separating ngram tokens
streamCUDA stream used for device memory operations and kernel launches
mrDevice memory resource used to allocate the returned column's device memory
Returns
New strings columns of tokens