libcudf  24.02.00
Files | Functions
Jaccard Index

Files

file  jaccard.hpp
 

Functions

std::unique_ptr< cudf::columnnvtext::jaccard_index (cudf::strings_column_view const &input1, cudf::strings_column_view const &input2, cudf::size_type width, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource())
 Computes the Jaccard similarity between individual rows in two strings columns. More...
 

Detailed Description

Function Documentation

◆ jaccard_index()

Computes the Jaccard similarity between individual rows in two strings columns.

The similarity is calculated between strings in corresponding rows such that output[row] = J(input1[row],input2[row]).

The Jaccard index formula is https://en.wikipedia.org/wiki/Jaccard_index

J = |A ∩ B| / |A ∪ B|
where |A ∩ B| is number of common values between A and B
and |x| is the number of unique values in x.
std::unique_ptr< table > unique(table_view const &input, std::vector< size_type > const &keys, duplicate_keep_option keep, null_equality nulls_equal=null_equality::EQUAL, rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource())
Create a new table with consecutive duplicate rows removed.

The computation here compares strings columns by treating each string as text (i.e. sentences, paragraphs, articles) instead of individual words or tokens to be compared directly. The algorithm applies a sliding window (size specified by the width parameter) to each string to form the set of tokens to compare within each row of the two input columns.

These substrings are essentially character ngrams and used as part of the union and intersect calculations for that row. For efficiency, the substrings are hashed using the default MurmurHash32 to identify uniqueness within each row. Once the union and intersect sizes for the row are resolved, the Jaccard index is computed using the above formula and returned as a float32 value.

input1 = ["the fuzzy dog", "little piggy", "funny bunny", "chatty parrot"]
input2 = ["the fuzzy cat", "bitty piggy", "funny bunny", "silent partner"]
r = jaccard_index(input1, input2)
r is now [0.5, 0.15384616, 1.0, 0]

If either input column's row is null, the output for that row will also be null.

Exceptions
std::invalid_argumentif the width < 2 or input1.size() != input2.size()
Parameters
input1Strings column to compare with input2
input2Strings column to compare with input1
widthThe character width used for apply substrings
streamCUDA stream used for device memory operations and kernel launches
mrDevice memory resource used to allocate the returned column's device memory
Returns
Index calculation values