libcudf  24.02.00
Files | Enumerations | Functions
Stemming

Files

file  stemmer.hpp
 

Enumerations

enum class  nvtext::letter_type { nvtext::CONSONANT , nvtext::VOWEL }
 Used for specifying letter type to check. More...
 

Functions

std::unique_ptr< cudf::columnnvtext::is_letter (cudf::strings_column_view const &input, letter_type ltype, cudf::size_type character_index, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource())
 Returns boolean column indicating if character_index of the input strings is a consonant or vowel. More...
 
std::unique_ptr< cudf::columnnvtext::is_letter (cudf::strings_column_view const &input, letter_type ltype, cudf::column_view const &indices, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource())
 Returns boolean column indicating if character at indices[i] of input[i] is a consonant or vowel. More...
 
std::unique_ptr< cudf::columnnvtext::porter_stemmer_measure (cudf::strings_column_view const &input, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource())
 Returns the Porter Stemmer measurements of a strings column. More...
 

Detailed Description

Enumeration Type Documentation

◆ letter_type

enum nvtext::letter_type
strong

Used for specifying letter type to check.

Enumerator
CONSONANT 

Letter is a consonant.

VOWEL 

Letter is not a consonant.

Definition at line 32 of file stemmer.hpp.

Function Documentation

◆ is_letter() [1/2]

std::unique_ptr<cudf::column> nvtext::is_letter ( cudf::strings_column_view const &  input,
letter_type  ltype,
cudf::column_view const &  indices,
rmm::cuda_stream_view  stream = cudf::get_default_stream(),
rmm::mr::device_memory_resource mr = rmm::mr::get_current_device_resource() 
)

Returns boolean column indicating if character at indices[i] of input[i] is a consonant or vowel.

Determining consonants and vowels is described in the following paper: https://tartarus.org/martin/PorterStemmer/def.txt

Each string in the input column is expected to contain a single, lower-cased word (or subword) with no punctuation and no whitespace otherwise the measure value for that string is undefined.

Also, the algorithm only works with English words.

Example:
st = ["trouble", "toy", "sygyzy"]
ix = [3, 1, 4]
b1 = is_letter(st, VOWEL, ix)
b1 is now [true, true, false]

A negative index value will check the character starting from the end of each string. That is, for character_index < 0 the letter checked for string strings[i] is at position strings[i].length + indices[i].

Example:
st = ["trouble", "toy", "sygyzy"]
ix = [3, -2, 4] // 2nd to last character in st[1] is checked
b2 = is_letter(st, CONSONANT, ix)
b2 is now [false, false, true]

A null input element at row i produces a corresponding null entry for row i in the output column.

Exceptions
cudf::logic_errorif indices.size() != input.size()
cudf::logic_errorif indices contain nulls.
Parameters
inputStrings column of words to measure
ltypeSpecify letter type to check
indicesThe character positions to check in each string
streamCUDA stream used for device memory operations and kernel launches
mrDevice memory resource used to allocate the returned column's device memory
Returns
New BOOL column

◆ is_letter() [2/2]

std::unique_ptr<cudf::column> nvtext::is_letter ( cudf::strings_column_view const &  input,
letter_type  ltype,
cudf::size_type  character_index,
rmm::cuda_stream_view  stream = cudf::get_default_stream(),
rmm::mr::device_memory_resource mr = rmm::mr::get_current_device_resource() 
)

Returns boolean column indicating if character_index of the input strings is a consonant or vowel.

Determining consonants and vowels is described in the following paper: https://tartarus.org/martin/PorterStemmer/def.txt

Each string in the input column is expected to contain a single, lower-cased word (or subword) with no punctuation and no whitespace otherwise the measure value for that string is undefined.

Also, the algorithm only works with English words.

Example:
st = ["trouble", "toy", "sygyzy"]
b1 = is_letter(st, VOWEL, 1)
b1 is now [false, true, true]

A negative index value will check the character starting from the end of each string. That is, for character_index < 0 the letter checked for string input[i] is at position input[i].length + index.

Example:
st = ["trouble", "toy", "sygyzy"]
b2 = is_letter(st, CONSONANT, -1) // last letter checked in each string
b2 is now [false, true, false]

A null input element at row i produces a corresponding null entry for row i in the output column.

Parameters
inputStrings column of words to measure
ltypeSpecify letter type to check
character_indexThe character position to check in each string
streamCUDA stream used for device memory operations and kernel launches
mrDevice memory resource used to allocate the returned column's device memory
Returns
New BOOL column

◆ porter_stemmer_measure()

std::unique_ptr<cudf::column> nvtext::porter_stemmer_measure ( cudf::strings_column_view const &  input,
rmm::cuda_stream_view  stream = cudf::get_default_stream(),
rmm::mr::device_memory_resource mr = rmm::mr::get_current_device_resource() 
)

Returns the Porter Stemmer measurements of a strings column.

Porter stemming is used to normalize words by removing plural and tense endings from words in English. The stemming measurement involves counting consonant/vowel patterns within a string. Reference paper: https://tartarus.org/martin/PorterStemmer/def.txt

Each string in the input column is expected to contain a single, lower-cased word (or subword) with no punctuation and no whitespace otherwise the measure value for that string is undefined.

Also, the algorithm only works with English words.

Example:
st = ["tr", "troubles", "trouble"]
m = porter_stemmer_measure(st)
m is now [0,2,1]

A null input element at row i produces a corresponding null entry for row i in the output column.

Parameters
inputStrings column of words to measure
mrDevice memory resource used to allocate the returned column's device memory
streamCUDA stream used for device memory operations and kernel launches
Returns
New INT32 column of measure values