libcudf  24.02.00
Files | Functions
Normalizing

Files

file  normalize.hpp
 

Functions

std::unique_ptr< cudf::columnnvtext::normalize_spaces (cudf::strings_column_view const &input, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource())
 Returns a new strings column by normalizing the whitespace in each string in the input column. More...
 
std::unique_ptr< cudf::columnnvtext::normalize_characters (cudf::strings_column_view const &input, bool do_lower_case, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource())
 Normalizes strings characters for tokenizing. More...
 

Detailed Description

Function Documentation

◆ normalize_characters()

std::unique_ptr<cudf::column> nvtext::normalize_characters ( cudf::strings_column_view const &  input,
bool  do_lower_case,
rmm::cuda_stream_view  stream = cudf::get_default_stream(),
rmm::mr::device_memory_resource mr = rmm::mr::get_current_device_resource() 
)

Normalizes strings characters for tokenizing.

This uses the normalizer that is built into the nvtext::subword_tokenize function which includes:

  • adding padding around punctuation (unicode category starts with "P") as well as certain ASCII symbols like "^" and "$"
  • adding padding around the CJK Unicode block characters
  • changing whitespace (e.g. "\t", "\n", "\r") to just space " "
  • removing control characters (unicode categories "Cc" and "Cf")

The padding process here adds a single space before and after the character. Details on unicode category can be found here: https://unicodebook.readthedocs.io/unicode.html#categories

If do_lower_case = true, lower-casing also removes the accents. The accents cannot be removed from upper-case characters without lower-casing and lower-casing cannot be performed without also removing accents. However, if the accented character is already lower-case, then only the accent is removed.

s = ["éâîô\teaio", "ĂĆĖÑÜ", "ACENU", "$24.08", "[a,bb]"]
s1 = normalize_characters(s,true)
s1 is now ["eaio eaio", "acenu", "acenu", " $ 24 . 08", " [ a , bb ] "]
s2 = normalize_characters(s,false)
s2 is now ["éâîô eaio", "ĂĆĖÑÜ", "ACENU", " $ 24 . 08", " [ a , bb ] "]

A null input element at row i produces a corresponding null entry for row i in the output column.

This function requires about 16x the number of character bytes in the input strings column as working memory.

Parameters
inputThe input strings to normalize
do_lower_caseIf true, upper-case characters are converted to lower-case and accents are stripped from those characters. If false, accented and upper-case characters are not transformed.
streamCUDA stream used for device memory operations and kernel launches
mrMemory resource to allocate any returned objects
Returns
Normalized strings column

◆ normalize_spaces()

Returns a new strings column by normalizing the whitespace in each string in the input column.

Normalizing a string replaces any number of whitespace character (character code-point <= ' ') runs with a single space ' ' and trims whitespace from the beginning and end of the string.

Example:
s = ["a b", " c d\n", "e \t f "]
t = normalize_spaces(s)
t is now ["a b","c d","e f"]

A null input element at row i produces a corresponding null entry for row i in the output column.

Parameters
inputStrings column to normalize
mrDevice memory resource used to allocate the returned column's device memory
streamCUDA stream used for device memory operations and kernel launches
Returns
New strings columns of normalized strings.