libcudf  23.12.00
Files | Functions
MinHashing

Files

file  minhash.hpp
 

Functions

std::unique_ptr< cudf::columnnvtext::minhash (cudf::strings_column_view const &input, cudf::numeric_scalar< uint32_t > seed=0, cudf::size_type width=4, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource())
 Returns the minhash value for each string. More...
 
std::unique_ptr< cudf::columnnvtext::minhash (cudf::strings_column_view const &input, cudf::device_span< uint32_t const > seeds, cudf::size_type width=4, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource())
 Returns the minhash values for each string per seed. More...
 
std::unique_ptr< cudf::columnnvtext::minhash64 (cudf::strings_column_view const &input, cudf::numeric_scalar< uint64_t > seed=0, cudf::size_type width=4, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource())
 Returns the minhash value for each string. More...
 
std::unique_ptr< cudf::columnnvtext::minhash64 (cudf::strings_column_view const &input, cudf::device_span< uint64_t const > seeds, cudf::size_type width=4, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource())
 Returns the minhash values for each string per seed. More...
 

Detailed Description

Function Documentation

◆ minhash() [1/2]

std::unique_ptr<cudf::column> nvtext::minhash ( cudf::strings_column_view const &  input,
cudf::device_span< uint32_t const >  seeds,
cudf::size_type  width = 4,
rmm::cuda_stream_view  stream = cudf::get_default_stream(),
rmm::mr::device_memory_resource mr = rmm::mr::get_current_device_resource() 
)

Returns the minhash values for each string per seed.

Hash values are computed from substrings of each string and the minimum hash value is returned for each string for each seed. Each row of the list column are seed results for the corresponding string. The order of the elements in each row match the order of the seeds provided in the seeds parameter.

This function uses MurmurHash3_x86_32 for the hash algorithm.

Any null row entries result in corresponding null output rows.

Exceptions
std::invalid_argumentif the width < 2
std::invalid_argumentif seeds is empty
std::overflow_errorif seeds * input.size() exceeds the column size limit
Parameters
inputStrings column to compute minhash
seedsSeed values used for the hash algorithm
widthThe character width used for apply substrings; Default is 4 characters.
streamCUDA stream used for device memory operations and kernel launches
mrDevice memory resource used to allocate the returned column's device memory
Returns
List column of minhash values for each string per seed

◆ minhash() [2/2]

std::unique_ptr<cudf::column> nvtext::minhash ( cudf::strings_column_view const &  input,
cudf::numeric_scalar< uint32_t >  seed = 0,
cudf::size_type  width = 4,
rmm::cuda_stream_view  stream = cudf::get_default_stream(),
rmm::mr::device_memory_resource mr = rmm::mr::get_current_device_resource() 
)

Returns the minhash value for each string.

Hash values are computed from substrings of each string and the minimum hash value is returned for each string.

Any null row entries result in corresponding null output rows.

This function uses MurmurHash3_x86_32 for the hash algorithm.

Exceptions
std::invalid_argumentif the width < 2
Parameters
inputStrings column to compute minhash
seedSeed value used for the hash algorithm
widthThe character width used for apply substrings; Default is 4 characters.
streamCUDA stream used for device memory operations and kernel launches
mrDevice memory resource used to allocate the returned column's device memory
Returns
Minhash values for each string in input

◆ minhash64() [1/2]

std::unique_ptr<cudf::column> nvtext::minhash64 ( cudf::strings_column_view const &  input,
cudf::device_span< uint64_t const >  seeds,
cudf::size_type  width = 4,
rmm::cuda_stream_view  stream = cudf::get_default_stream(),
rmm::mr::device_memory_resource mr = rmm::mr::get_current_device_resource() 
)

Returns the minhash values for each string per seed.

Hash values are computed from substrings of each string and the minimum hash value is returned for each string for each seed. Each row of the list column are seed results for the corresponding string. The order of the elements in each row match the order of the seeds provided in the seeds parameter.

This function uses MurmurHash3_x64_128 for the hash algorithm.

Any null row entries result in corresponding null output rows.

Exceptions
std::invalid_argumentif the width < 2
std::invalid_argumentif seeds is empty
std::overflow_errorif seeds * input.size() exceeds the column size limit
Parameters
inputStrings column to compute minhash
seedsSeed values used for the hash algorithm
widthThe character width used for apply substrings; Default is 4 characters.
streamCUDA stream used for device memory operations and kernel launches
mrDevice memory resource used to allocate the returned column's device memory
Returns
List column of minhash values for each string per seed

◆ minhash64() [2/2]

std::unique_ptr<cudf::column> nvtext::minhash64 ( cudf::strings_column_view const &  input,
cudf::numeric_scalar< uint64_t >  seed = 0,
cudf::size_type  width = 4,
rmm::cuda_stream_view  stream = cudf::get_default_stream(),
rmm::mr::device_memory_resource mr = rmm::mr::get_current_device_resource() 
)

Returns the minhash value for each string.

Hash values are computed from substrings of each string and the minimum hash value is returned for each string.

Any null row entries result in corresponding null output rows.

This function uses MurmurHash3_x64_128 for the hash algorithm. The hash function returns 2 uint64 values but only the first value is used with the minhash calculation.

Exceptions
std::invalid_argumentif the width < 2
Parameters
inputStrings column to compute minhash
seedSeed value used for the hash algorithm
widthThe character width used for apply substrings; Default is 4 characters.
streamCUDA stream used for device memory operations and kernel launches
mrDevice memory resource used to allocate the returned column's device memory
Returns
Minhash values as UINT64 for each string in input