libcudf
24.04.00
|
This document serves as a guide for contributors to libcudf C++ code. Developers should also refer to these additional files for further documentation of libcudf best practices.
libcudf is a C++ library that provides GPU-accelerated data-parallel algorithms for processing column-oriented tabular data. libcudf provides algorithms including slicing, filtering, sorting, various types of aggregations, and database-type operations such as grouping and joins. libcudf serves a number of clients via multiple language interfaces, including Python and Java. Users may also use libcudf directly from C++ code.
This section defines terminology used within libcudf.
A column is an array of data of a single type. Along with Tables, columns are the fundamental data structures used in libcudf. Most libcudf algorithms operate on columns. Columns may have a validity mask representing whether each element is valid or null (invalid). Columns of nested types are supported, meaning that a column may have child columns. A column is the C++ equivalent to a cuDF Python Series.
An individual data item within a column. Also known as a row.
A type representing a single element of a data type.
A table is a collection of columns with equal number of elements. A table is the C++ equivalent to a cuDF Python DataFrame.
A view is a non-owning object that provides zero-copy access (possibly with slicing or offsets) to data owned by another object. Examples are column views and table views.
External/public libcudf APIs are grouped based on functionality into an appropriately titled header file in cudf/cpp/include/cudf/
. For example, cudf/cpp/include/cudf/copying.hpp
contains the APIs for functions related to copying from one column to another. Note the .hpp
file extension used to indicate a C++ header file.
Header files should use the #pragma once
include guard.
The naming of external API headers should be consistent with the name of the folder that contains the source files that implement the API. For example, the implementation of the APIs found in cudf/cpp/include/cudf/copying.hpp
are located in cudf/src/copying
. Likewise, the unit tests for the APIs reside in cudf/tests/copying/
.
Internal API headers containing detail
namespace definitions that are used across translation units inside libcudf should be placed in include/cudf/detail
.
.hpp
: C++ header files.cpp
: C++ source files.cu
: CUDA C++ source files.cuh
: Headers containing CUDA device codeOnly use .cu
and .cuh
if necessary. A good indicator is the inclusion of __device__
and other symbols that are only recognized by nvcc
. Another indicator is Thrust algorithm APIs with a device execution policy (always rmm::exec_policy
in libcudf).
libcudf code uses snake_case for all names except in a few cases: template parameters, unit tests and test case names may use Pascal case, aka UpperCamelCase. We do not use Hungarian notation, except sometimes when naming device data variables and their corresponding host copies. Private member variables are typically prefixed with an underscore.
C++ formatting is enforced using clang-format
. You should configure clang-format
on your machine to use the cudf/cpp/.clang-format
configuration file, and run clang-format
on all changed code before committing it. The easiest way to do this is to configure your editor to "format on save."
Aspects of code style not discussed in this document and not automatically enforceable are typically caught during code review, or not enforced.
In general, we recommend following C++ Core Guidelines. We also recommend watching Sean Parent's C++ Seasoning talk, and we try to follow his rules: "No raw loops. No raw pointers. No raw synchronization primitives."
Additional style guidelines for libcudf code:
const
after the type. This is not automatically enforced by clang-format
because the option QualifierAlignment: Right
has been observed to produce false negatives and false positives.1'234'567
. Hexadecimal values should use separators every 4 characters, like 0x0123'ABCD
.Documentation is discussed in the Documentation Guide.
The following guidelines apply to organizing #include
lines.
clang-format
will respect the groupings and sort the individual includes within a group lexicographically.<thrust/...>
, then includes from dependencies installed with cuDF, and then standard headers (for example <string>
, <iostream>
).cudf/cpp/.clang-format
file for specifics.<>
for all includes except for internal headers that are not in the include
directory. In other words, if it is a cuDF internal header (e.g. in the src
or test
directory), the path will not start with cudf
(e.g. #include <cudf/some_header.hpp>
) so it should use quotes. Example: #include "io/utilities/hostdevice_vector.hpp"
.cudf_test
and nvtext
are separate libraries within the libcudf
repo. As such, they have public headers in include
that should be included with <>
.clangd
often auto-insert includes when they can, but they usually get the grouping and brackets wrong. Correct the usage of quotes or brackets and then run clang-format to correct the grouping...
when possible. Paths with ..
are necessary when including (internal) headers from source paths not in the same directory as the including file, because source paths are not passed with -I
.src
directories in tests or in libcudf public headers. If you find yourself doing this, start a discussion about moving (parts of) the included internal header to a public header.Application data in libcudf is contained in Columns and Tables, but there are a variety of other data structures you will use when developing libcudf code.
Resource ownership is an essential concept in libcudf. In short, an "owning" object owns a resource (such as device memory). It acquires that resource during construction and releases the resource in destruction (RAII). A "non-owning" object does not own resources. Any class in libcudf with the *_view
suffix is non-owning. For more detail see the libcudf
presentation.
libcudf functions typically take views as input (column_view
or table_view
) and produce unique_ptr
s to owning objects as output. For example,
libcudf allocates all device memory via RMM memory resources (MR). See the RMM documentation for details.
RMM provides a "default" memory resource for each device that can be accessed and updated via the rmm::mr::get_current_device_resource()
and rmm::mr::set_current_device_resource(...)
functions, respectively. All memory resource parameters should be defaulted to use the return value of rmm::mr::get_current_device_resource()
.
cudf::column
is a core owning data structure in libcudf. Most libcudf public APIs produce either a cudf::column
or a cudf::table
as output. A column
contains device_buffer
s which own the device memory for the elements of a column and an optional null indicator bitmask.
Implicitly convertible to column_view
and mutable_column_view
.
Movable and copyable. A copy performs a deep copy of the column's contents, whereas a move moves the contents from one column to another.
Example:
A column
may have nested (child) columns, depending on the data type of the column. For example, LIST
, STRUCT
, and STRING
type columns.
cudf::column_view
is a core non-owning data structure in libcudf. It is an immutable, non-owning view of device memory as a column. Most libcudf public APIs take views as inputs.
A column_view
may be a view of a "slice" of a column. For example, it might view rows 75-150 of a column with 1000 rows. The size()
of this column_view
would be 75
, and accessing index 0
of the view would return the element at index 75
of the owning column
. Internally, this is implemented by storing in the view a pointer, an offset, and a size. column_view::data<T>()
returns a pointer iterator to column_view::head<T>() + offset
.
A mutable, non-owning view of device memory as a column. Used for detail APIs and (rare) public APIs that modify columns in place.
An immutable, non-owning view of device data as a column of elements that is trivially copyable and usable in CUDA device code. Used to pass column_view
data as input to CUDA kernels and device functions (including Thrust algorithms)
A mutable, non-owning view of device data as a column of elements that is trivially copyable and usable in CUDA device code. Used to pass column_view
data to be modified on the device by CUDA kernels and device functions (including Thrust algorithms).
Owning class for a set of cudf::column
s all with equal number of elements. This is the C++ equivalent to a data frame.
Implicitly convertible to cudf::table_view
and cudf::mutable_table_view
Movable and copyable. A copy performs a deep copy of all columns, whereas a move moves all columns from one table to another.
An immutable, non-owning view of a table.
A mutable, non-owning view of a table.
The cudf::size_type
is the type used for the number of elements in a column, offsets to elements within a column, indices to address specific elements, segments for subsets of column elements, etc. It is equivalent to a signed, 32-bit integer type and therefore has a maximum value of 2147483647. Some APIs also accept negative index values and those functions support a minimum value of -2147483648. This fundamental type also influences output values not just for column size limits but for counting elements as well.
libcudf provides span
classes that mimic C++20 std::span
, which is a lightweight view of a contiguous sequence of objects. libcudf provides two classes, host_span
and device_span
, which can be constructed from multiple container types, or from a pointer (host or device, respectively) and size, or from iterators. span
types are useful for defining generic (internal) interfaces which work with multiple input container types. device_span
can be constructed from thrust::device_vector
, rmm::device_vector
, or rmm::device_uvector
. host_span
can be constructed from thrust::host_vector
, std::vector
, or std::basic_string
.
If you are defining internal (detail) functions that operate on vectors, use spans for the input vector parameters rather than a specific vector type, to make your functions more widely applicable.
When a span
refers to immutable elements, use span<T const>
, not span<T> const
. Since a span is lightweight view, it does not propagate const
-ness. Therefore, const
should be applied to the template type parameter, not to the span
itself. Also, span
should be passed by value because it is a lightweight view. APIS in libcudf that take spans as input will look like the following function that copies device data to a host std::vector
.
A cudf::scalar
is an object that can represent a singular, nullable value of any of the types currently supported by cudf. Each type of value is represented by a separate type of scalar class which are all derived from cudf::scalar
. e.g. A numeric_scalar
holds a single numerical value, a string_scalar
holds a single string. The data for the stored value resides in device memory.
A list_scalar
holds the underlying data of a single list. This means the underlying data can be any type that cudf supports. For example, a list_scalar
representing a list of integers stores a cudf::column
of type INT32
. A list_scalar
representing a list of lists of integers stores a cudf::column
of type LIST
, which in turn stores a column of type INT32
.
Value type | Scalar class | Notes |
---|---|---|
fixed-width | fixed_width_scalar<T> | T can be any fixed-width type |
numeric | numeric_scalar<T> | T can be int8_t , int16_t , int32_t , int_64_t , float or double |
fixed-point | fixed_point_scalar<T> | T can be numeric::decimal32 or numeric::decimal64 |
timestamp | timestamp_scalar<T> | T can be timestamp_D , timestamp_s , etc. |
duration | duration_scalar<T> | T can be duration_D , duration_s , etc. |
string | string_scalar | This class object is immutable |
list | list_scalar | Underlying data can be any type supported by cudf |
scalar
s can be created using either their respective constructors or using factory functions like make_numeric_scalar()
, make_timestamp_scalar()
or make_string_scalar()
.
All the factory methods return a unique_ptr<scalar>
which needs to be statically downcasted to its respective scalar class type before accessing its value. Their validity (nullness) can be accessed without casting. Generally, the value needs to be accessed from a function that is aware of the value type e.g. a functor that is dispatched from type_dispatcher
. To cast to the requisite scalar class type given the value type, use the mapping utility scalar_type_t
provided in type_dispatcher.hpp
:
Each scalar type, except list_scalar
, has a corresponding non-owning device view class which allows access to the value and its validity from the device. This can be obtained using the function get_scalar_device_view(ScalarType s)
. Note that a device view is not provided for a base scalar object, only for the derived typed scalar class objects.
The underlying data for list_scalar
can be accessed via view()
method. For non-nested data, the device view can be obtained via function column_device_view::create(column_view)
. For nested data, a specialized device view for list columns can be constructed via lists_column_device_view(column_device_view)
.
libcudf
is designed to provide thread-safe, single-GPU accelerated algorithm primitives for solving a wide variety of problems that arise in data science. APIs are written to execute on the default GPU, which can be controlled by the caller through standard CUDA device APIs or environment variables like CUDA_VISIBLE_DEVICES
. Our goal is to enable diverse use cases like Spark or Pandas to benefit from the performance of GPUs, and libcudf relies on these higher-level layers like Spark or Dask to orchestrate multi-GPU tasks.
To best satisfy these use-cases, libcudf prioritizes performance and flexibility, which sometimes may come at the cost of convenience. While we welcome users to use libcudf directly, we design with the expectation that most users will be consuming libcudf through higher-level layers like Spark or cuDF Python that handle some of details that direct users of libcudf must handle on their own. We document these policies and the reasons behind them here.
libcudf APIs generally do not perform deep introspection and validation of input data. There are numerous reasons for this:
Users are therefore responsible for passing valid data into such APIs. Note that this policy does not mean that libcudf performs no validation whatsoever. libcudf APIs should still perform any validation that does not require introspection. To give some idea of what should or should not be validated, here are (non-exhaustive) lists of examples.
Things that libcudf should validate:
Things that libcudf should not validate:
Various libcudf APIs accepting columns of nested data types (such as LIST
or STRUCT
) may assume that these columns have been sanitized. In this context, sanitization refers to ensuring that the null elements in a column with a nested dtype are compatible with the elements of nested columns. Specifically:
libcudf APIs should promise to never return "dirty" columns, i.e. columns containing unsanitized data. Therefore, the only problem is if users construct input columns that are not correctly sanitized and then pass those into libcudf APIs.
libcudf APIs called on the host do not guarantee that the stream is synchronized before returning. Work in libcudf occurs on cudf::get_default_stream().value
, which defaults to the CUDA default stream (stream 0). Note that the stream 0 behavior differs if per-thread default stream is enabled via CUDF_USE_PER_THREAD_DEFAULT_STREAM
. Any data provided to or returned by libcudf that uses a separate non-blocking stream requires synchronization with the default libcudf stream to ensure stream safety.
Functions like merge or groupby in libcudf make no guarantees about the order of entries in the output. Promising deterministic ordering is not, in general, conducive to fast parallel algorithms. Calling code is responsible for performing sorts after the fact if sorted outputs are needed.
libcudf documents the exceptions that will be thrown by an API for different kinds of invalid inputs. The types of those exceptions (e.g. cudf::logic_error
) are part of the public API. However, the explanatory string returned by the what
method of those exceptions is not part of the API and is subject to change. Calling code should not rely on the contents of libcudf error messages to determine the nature of the error. For information on the types of exceptions that libcudf throws under different circumstances, see the section on error handling.
libcudf is in the process of adding support for asynchronous execution using CUDA streams. In order to facilitate the usage of streams, all new libcudf APIs that allocate device memory or execute a kernel should accept an rmm::cuda_stream_view
parameter at the end with a default value of cudf::get_default_stream()
. There is one exception to this rule: if the API also accepts a memory resource parameter, the stream parameter should be placed just before the memory resource. This API should then forward the call to a corresponding detail
API with an identical signature, except that the detail
API should not have a default parameter for the stream (detail APIs should always avoid default parameters). The implementation should be wholly contained in the detail
API definition and use only asynchronous versions of CUDA APIs with the stream parameter.
In order to make the detail
API callable from other libcudf functions, it should be exposed in a header placed in the cudf/cpp/include/detail/
directory.
For example:
Note: It is important to synchronize the stream if and only if it is necessary. For example, when a non-pointer value is returned from the API that is the result of an asynchronous device-to-host copy, the stream used for the copy should be synchronized before returning. However, when a column is returned, the stream should not be synchronized because doing so will break asynchrony.
Note: cudaDeviceSynchronize()
should never be used. This limits the ability to do any multi-stream/multi-threaded work with libcudf APIs.
There may be times in implementing libcudf features where it would be advantageous to use streams internally, i.e., to accomplish overlap in implementing an algorithm. However, dynamically creating a stream can be expensive. RMM has a stream pool class to help avoid dynamic stream creation. However, this is not yet exposed in libcudf, so for the time being, libcudf features should avoid creating streams (even if it is slightly less efficient). It is a good idea to leave a // TODO:
note indicating where using a stream would be beneficial.
Device memory resources are used in libcudf to abstract and control how device memory is allocated.
Any libcudf API that allocates memory that is returned to a user must accept a pointer to a device_memory_resource
as the last parameter. Inside the API, this memory resource must be used to allocate any memory for returned objects. It should therefore be passed into functions whose outputs will be returned. Example:
This rule automatically applies to all detail APIs that allocates memory. Any detail API may be called by any public API, and therefore could be allocating memory that is returned to the user. To support such uses cases, all detail APIs allocating memory resources should accept an mr
parameter. Callers are responsible for either passing through a provided mr
or rmm::mr::get_current_device_resource()
as needed.
Not all memory allocated within a libcudf API is returned to the caller. Often algorithms must allocate temporary, scratch memory for intermediate results. Always use the default resource obtained from rmm::mr::get_current_device_resource()
for temporary memory allocations. Example:
libcudf code generally eschews raw pointers and direct memory allocation. Use RMM classes built to use device_memory_resource
s for device memory allocation with automated lifetime management.
Allocates a specified number of bytes of untyped, uninitialized device memory using a device_memory_resource
. If no resource is explicitly provided, uses rmm::mr::get_current_device_resource()
.
rmm::device_buffer
is movable and copyable on a stream. A copy performs a deep copy of the device_buffer
's device memory on the specified stream, whereas a move moves ownership of the device memory from one device_buffer
to another.
Allocates a single element of the specified type initialized to the specified value. Use this for scalar input/outputs into device kernels, e.g., reduction results, null count, etc. This is effectively a convenience wrapper around a rmm::device_vector<T>
of length 1.
Allocates a specified number of elements of the specified type. If no initialization value is provided, all elements are default initialized (this incurs a kernel launch).
Note: We have removed all usage of rmm::device_vector
and thrust::device_vector
from libcudf, and you should not use it in new code in libcudf without careful consideration. Instead, use rmm::device_uvector
along with the utility factories in device_factories.hpp
. These utilities enable creation of uvector
s from host-side vectors, or creating zero-initialized uvector
s, so that they are as convenient to use as device_vector
. Avoiding device_vector
has a number of benefits, as described in the following section on rmm::device_uvector
.
Similar to a device_vector
, allocates a contiguous set of elements in device memory but with key differences:
T
to trivially copyable types.cuda_stream_view
specifying the stream on which the operation is performed). This improves safety when using non-default streams.device_uvector.hpp
does not include any __device__
code, unlike thrust/device_vector.hpp
, which means device_uvector
s can be used in .cpp
files, rather than just in .cu
files.While public libcudf APIs are free to include default function parameters, detail functions should not. Default memory resource parameters make it easy for developers to accidentally allocate memory using the incorrect resource. Avoiding default memory resources forces developers to consider each memory allocation carefully.
While streams are not currently exposed in libcudf's API, we plan to do so eventually. As a result, the same reasons for memory resources also apply to streams. Public APIs default to using cudf::get_default_stream()
. However, including the same default in detail APIs opens the door for developers to forget to pass in a user-provided stream if one is passed to a public API. Forcing every detail API call to explicitly pass a stream is intended to prevent such mistakes.
The memory resources (and eventually, the stream) are the final parameters for essentially all public APIs. For API consistency, the same is true throughout libcudf's internals. Therefore, a consequence of not allowing default streams or MRs is that no parameters in detail APIs may have defaults.
In order to aid in performance optimization and debugging, all compute intensive libcudf functions should have a corresponding NVTX range. Choose between CUDF_FUNC_RANGE
or cudf::scoped_range
for declaring NVTX ranges in the current scope:
CUDF_FUNC_RANGE()
macro if you want to use the name of the function as the name of the NVTX rangecudf::scoped_range rng{"custom_name"};
to provide a custom name for the current scope's NVTX rangeFor more information about NVTX, see here.
The preferred style for how inputs are passed in and outputs are returned is the following:
column_view const&
table_view const&
scalar const&
const&
mutable_column_view&
mutable_table_view&
std::unique_ptr<column>
std::unique_ptr<table>
std::unique_ptr<scalar>
Sometimes it is necessary for functions to have multiple outputs. There are a few ways this can be done in C++ (including creating a struct
for the output). One convenient way to do this is using std::tie
and std::pair
. Note that objects passed to std::pair
will invoke either the copy constructor or the move constructor of the object, and it may be preferable to move non-trivially copyable objects (and required for types with deleted copy constructors, like std::unique_ptr
).
Note: std::tuple
could be used if not for the fact that Cython does not support std::tuple
. Therefore, libcudf APIs must use std::pair
, and are therefore limited to return only two objects of different types. Multiple objects of the same type may be returned via a std::vector<T>
.
Alternatively, with C++17 (supported from cudf v0.20), structured binding may be used to disaggregate multiple return values:
Note that the compiler might not support capturing aliases defined in a structured binding in a lambda. One may work around this by using a capture with an initializer instead:
Increasingly, libcudf is moving toward internal (detail
) APIs with iterator parameters rather than explicit column
/table
/scalar
parameters. As with STL, iterators enable generic algorithms to be applied to arbitrary containers. A good example of this is cudf::copy_if_else
. This function takes two inputs, and a Boolean mask. It copies the corresponding element from the first or second input depending on whether the mask at that index is true
or false
. Implementing copy_if_else
for all combinations of column
and scalar
parameters is simplified by using iterators in the detail
API.
LeftIter
and RightIter
need only implement the necessary interface for an iterator. libcudf provides a number of iterator types and utilities that are useful with iterator-based APIs from libcudf as well as Thrust algorithms. Most are defined in include/detail/iterator.cuh
.
The pair iterator is used to access elements of nullable columns as a pair containing an element's value and validity. cudf::detail::make_pair_iterator
can be used to create a pair iterator from a column_device_view
or a cudf::scalar
. make_pair_iterator
is not available for mutable_column_device_view
.
This iterator replaces the null/validity value for each element with a specified constant (true
or false
). Created using cudf::detail::make_null_replacement_iterator
.
This iterator returns the validity of the underlying element (true
or false
). Created using cudf::detail::make_validity_iterator
.
The proliferation of data types supported by libcudf can result in long compile times. One area where compile time was a problem is in types used to store indices, which can be any integer type. The "Indexalator", or index-normalizing iterator (include/cudf/detail/indexalator.cuh
), can be used for index types (integers) without requiring a type-specific instance. It can be used for any iterator interface for reading an array of integer values of type int8
, int16
, int32
, int64
, uint8
, uint16
, uint32
, or uint64
. Reading specific elements always returns a cudf::size_type
integer.
Use the indexalator_factory
to create an appropriate input iterator from a column_view. Example input iterator usage:
Example output iterator usage:
All public libcudf APIs should be placed in the cudf
namespace. Example:
The top-level cudf
namespace is sufficient for most of the public API. However, to logically group a broad set of functions, further namespaces may be used. For example, there are numerous functions that are specific to columns of Strings. These functions reside in the cudf::strings::
namespace. Similarly, functionality used exclusively for unit testing is in the cudf::test::
namespace.
Many functions are not meant for public use, so place them in either the detail
or an anonymous namespace, depending on the situation.
Functions or objects that will be used across multiple translation units (i.e., source files), should be exposed in an internal header file and placed in the detail
namespace. Example:
Functions or objects that will only be used in a single translation unit should be defined in an anonymous namespace in the source file where it is used. Example:
Anonymous namespaces should never be used in a header file.
libcudf is constantly evolving to improve performance and better meet our users' needs. As a result, we occasionally need to break or entirely remove APIs to respond to new and improved understanding of the functionality we provide. Remaining free to do this is essential to making libcudf an agile library that can rapidly accommodate our users needs. As a result, we do not always provide a warning or any lead time prior to releasing breaking changes. On a best effort basis, the libcudf team will notify users of changes that we expect to have significant or widespread effects.
Where possible, indicate pending API removals using the deprecated attribute and document them using Doxygen's deprecated command prior to removal. When a replacement API is available for a deprecated API, mention the replacement in both the deprecation message and the deprecation documentation. Pull requests that introduce deprecations should be labeled "deprecation" to facilitate discovery and removal in the subsequent release.
Advertise breaking changes by labeling any pull request that breaks or removes an existing API with the "breaking" tag. This ensures that the "Breaking" section of the release notes includes a description of what has broken from the past release. Label pull requests that contain deprecations with the "non-breaking" tag.
libcudf follows conventions (and provides utilities) enforcing compile-time and run-time conditions and detecting and handling CUDA errors. Communication of errors is always via C++ exceptions.
Use the CUDF_EXPECTS
macro to enforce runtime conditions necessary for correct execution.
Example usage:
The first argument is the conditional expression expected to resolve to true
under normal conditions. If the conditional evaluates to false
, then an error has occurred and an instance of cudf::logic_error
is thrown. The second argument to CUDF_EXPECTS
is a short description of the error that has occurred and is used for the exception's what()
message.
There are times where a particular code path, if reached, should indicate an error no matter what. For example, often the default
case of a switch
statement represents an invalid alternative. Use the CUDF_FAIL
macro for such errors. This is effectively the same as calling CUDF_EXPECTS(false, reason)
.
Example:
Use the CUDF_CUDA_TRY
macro to check for the successful completion of CUDA runtime API functions. This macro throws a cudf::cuda_error
exception if the CUDA API return value is not cudaSuccess
. The thrown exception includes a description of the CUDA error code in its what()
message.
Example:
Use static_assert
to enforce compile-time conditions. For example,
libcudf includes logging utilities (built on top of spdlog library), which should be used to log important events (e.g. user warnings). This utility can also be used to log debug information, as long as the correct logging level is used. There are six macros that should be used for logging at different levels:
CUDF_LOG_TRACE
- verbose debug messages (targeted at developers)CUDF_LOG_DEBUG
- debug messages (targeted at developers)CUDF_LOG_INFO
- information about rare events (e.g. once per run) that occur during normal executionCUDF_LOG_WARN
- user warnings about potentially unexpected behavior or deprecationsCUDF_LOG_ERROR
- recoverable errorsCUDF_LOG_CRITICAL
- unrecoverable errors (e.g. memory corruption)By default, TRACE
, DEBUG
and INFO
messages are excluded from the log. In addition, in public builds, the code that logs at TRACE
and DEBUG
levels is compiled out. This prevents logging of potentially sensitive data that might be done for debug purposes. Also, this allows developers to include expensive computation in the trace/debug logs, as the overhead will not be present in the public builds. The minimum enabled logging level is WARN
, and it can be modified in multiple ways:
LIBCUDF_LOGGING_LEVEL
- sets the minimum level of logging that will be compiled in the build. Available levels are TRACE
, DEBUG
, INFO
, WARN
, ERROR
, CRITICAL
, and OFF
.LIBCUDF_LOGGING_LEVEL
- sets the minimum logging level during initialization. If this setting is higher than the compile-time CMake variable, any logging levels in between the two settings will be excluded from the written log. The available levels are the same as for the CMake variable.cudf::logger()
- sets the minimum logging level at runtime. For example, calling cudf::logger().set_level(spdlog::level::err)
, will exclude any messages that are not errors or critical errors. This API should not be used within libcudf to manipulate logging, its purpose is to allow upstream users to configure libcudf logging to fit their application.By default, logging messages are output to stderr. Setting the environment variable LIBCUDF_DEBUG_LOG_FILE
redirects the log to a file with the specified path (can be relative to the current directory). Upstream users can also manipulate cudf::logger().sinks()
to add sinks or divert the log to standard output or even a custom spdlog sink.
Columns may contain data of a number of types (see enum class type_id
in include/cudf/types.hpp
)
Most algorithms must support columns of any data type. This leads to complexity in the code, and is one of the primary challenges a libcudf developer faces. Sometimes we develop new algorithms with gradual support for more data types to make this easier. Typically we start with fixed-width data types such as numeric types and timestamps/durations, adding support for nested types later.
Enabling an algorithm differently for different types uses either template specialization or SFINAE, as discussed in Specializing Type-Dispatched Code Paths.
libcudf stores data (for columns and scalars) "type erased" in void*
device memory. This type-erasure enables interoperability with other languages and type systems, such as Python and Java. In order to determine the type, libcudf algorithms must use the run-time information stored in the column type()
to reconstruct the data type T
by casting the void*
to the appropriate T*
.
This so-called type dispatch is pervasive throughout libcudf. The type_dispatcher
is a central utility that automates the process of mapping the runtime type information in data_type
to a concrete C++ type.
At a high level, you call the type_dispatcher
with a data_type
and a function object (also known as a functor) with an operator()
template. Based on the value of data_type::id()
, the type dispatcher invokes the corresponding instantiation of the operator()
template.
This simplified example shows how the value of data_type::id()
determines which instantiation of the F::operator()
template is invoked.
The following example shows a function object called size_of_functor
that returns the size of the dispatched type.
By default, type_dispatcher
uses cudf::type_to_id<t>
to provide the mapping of cudf::type_id
to dispatched C++ types. However, this mapping may be customized by explicitly specifying a user-defined trait for the IdTypeMap
. For example, to always dispatch int32_t
for all values of cudf::type_id
:
Avoid multiple type-dispatch if possible. The compiler creates a code path for every type dispatched, so a second-level type dispatch results in quadratic growth in compilation time and object code size. As a large library with many types and functions, we are constantly working to reduce compilation time and code size.
It is often necessary to customize the dispatched operator()
for different types. This can be done in several ways.
The first method is to use explicit, full template specialization. This is useful for specializing behavior for single types. The following example function object prints "int32_t"
or "double"
when invoked with either of those types, or "unhandled type"
otherwise.
The second method is to use SFINAE with std::enable_if_t
. This is useful to partially specialize for a set of types with a common trait. The following example functor prints integral
or floating point
for integral or floating point types, respectively.
For more info on SFINAE with std::enable_if
, see this post.
There are a number of traits defined in include/cudf/utilities/traits.hpp
that are useful for partial specialization of dispatched function objects. For example is_numeric<T>()
can be used to specialize for any numeric type.
libcudf supports a number of variable-size and nested data types, including strings, lists, and structs.
string
: Simply a character string, but a column of strings may have a different-length string in each row.list
: A list of elements of any type, so a column of lists of integers has rows with a list of integers, possibly of a different length, in each row.struct
: In a column of structs, each row is a structure comprising one or more fields. These fields are stored in structure-of-arrays format, so that the column of structs has a nested column for each field of the structure.As the heading implies, list and struct columns may be nested arbitrarily. One may create a column of lists of structs, where the fields of the struct may be of any type, including strings, lists and structs. Thinking about deeply nested data types can be confusing for column-based data, even with experience. Therefore it is important to carefully write algorithms, and to test and document them well.
In order to represent variable-width elements, libcudf columns contain a vector of child columns. For list columns, the parent column's type is LIST
and contains no data, but its size represents the number of lists in the column, and its null mask represents the validity of each list element. The parent has two children.
size_type
elements that indicates the offset to the beginning of each list in a dense column of elements.With this representation, data[offsets[i]]
is the first element of list i
, and the size of list i
is given by offsets[i+1] - offsets[i]
.
Note that the data may be of any type, and therefore the data column may itself be a nested column of any type. Note also that not only is each list nullable (using the null mask of the parent), but each list element may be nullable. So you may have a lists column with null row 3, and also null element 2 of row 4.
The underlying data for a lists column is always bundled into a single leaf column at the very bottom of the hierarchy (ignoring structs, which conceptually "reset" the root of the hierarchy), regardless of the level of nesting. So a List<List<List<List<int>>>>
column has a single int
column at the very bottom. The following is a visual representation of this.
This is related to Arrow's "Variable-Size List" memory layout.
Strings are represented as a column with a data device buffer and a child offsets column. The parent column's type is STRING
and its data holds all the characters across all the strings packed together but its size represents the number of strings in the column, and its null mask represents the validity of each string. To summarize, the strings column children are:
size_type
elements that indicates the offset to the beginning of each string in a dense data buffer of all characters.With this representation, data[offsets[i]]
is the first character of string i
, and the size of string i
is given by offsets[i+1] - offsets[i]
. The following image shows an example of this compound column representation of strings.
A struct is a nested data type with a set of child columns each representing an individual field of a logical struct. Field names are not represented.
A structs column with N
fields has N
children. Each child is a column storing all the data of a single field packed column-wise, with an optional null mask. The parent column's type is STRUCT
and contains no data, its size represents the number of struct rows in the column, and its null mask represents the validity of each struct element.
With this representation, child[0][10]
is row 10 of the first field of the struct, child[1][42]
is row 42 of the second field of the struct.
Notice that in addition to the struct column's null mask, each struct field column has its own optional null mask. A struct field's validity can vary independently from the corresponding struct row. For instance, a non-null struct row might have a null field. However, the fields of a null struct row are deemed to be null as well. For example, consider a struct column of type STRUCT<FLOAT32, INT32>
. If the contents are [ {1.0, 2}, {4.0, 5}, null, {8.0, null} ]
, the struct column's layout is as follows. (Note that null masks should be read from right to left.)
The last struct row (index 3) is not null, but has a null value in the INT32 field. Also, row 2 of the struct column is null, making its corresponding fields also null. Therefore, bit 2 is unset in the null masks of both struct fields.
Dictionaries provide an efficient way to represent low-cardinality data by storing a single copy of each value. A dictionary comprises a column of sorted keys and a column containing an index into the keys column for each row of the parent column. The keys column may have any libcudf data type, such as a numerical type or strings. The indices represent the corresponding positions of each element's value in the keys. The indices child column can have any unsigned integer type (UINT8
, UINT16
, UINT32
, or UINT64
).
The first challenge with nested columns is that it is effectively impossible to do any operation that modifies the length of any string or list in place. For example, consider trying to append the character ‘'a’to the end of each string. This requires dynamically resizing the characters column to allow inserting
'a'` at the end of each string, and then modifying the offsets column to indicate the new size of each element. As a result, every operation that can modify the strings or lists in a column must be done out-of-place.
The second challenge is that in an out-of-place operation on a strings column, unlike with fixed- width elements, the size of the output cannot be known a priori. For example, consider scattering into a column of strings:
destination: {"this", "is", "a", "column", "of", "strings"} scatter_map: {1, 3, 5} scatter_values: {"red", "green", "blue"} result: {"this", "red", "a", "green", "of", "blue"}
In this example, the strings "red", "green", and "blue" will respectively be scattered into positions 1
, 3
, and 5
of destination
. Recall from above that this operation cannot be done in place, therefore result
will be generated by selectively copying strings from destination
and scatter_values
. Notice that result
's child column of characters requires storage for 19
characters. However, there is no way to know ahead of time that result
will require 19
characters. Therefore, most operations that produce a new output column of strings use a two-phase approach:
In scatter, the first phase consists of using the scatter_map
to determine whether string i
in the output will come from destination
or from scatter_values
and use the corresponding size(s) to materialize the offsets column and determine the size of the output. Then, in the second phase, sufficient storage is allocated for the output's characters, and then the characters are filled with the corresponding strings from either destination
or scatter_values
.
libcudf provides view types for nested column types as well as for the data elements within them.
cudf::strings_column_view
is a view of a strings column, like cudf::column_view
is a view of any cudf::column
. cudf::string_view
is a view of a single string, and therefore cudf::string_view
is the data type of a cudf::column
of type STRING
just like int32_t
is the data type for a cudf::column
of type size_type
. As its name implies, this is a read-only object instance that points to device memory inside the strings column. It's lifespan is the same (or less) as the column it views.
Use the column_device_view::element
method to access an individual row element. Like any other column, do not call element()
on a row that is null.
A null string is not the same as an empty string. Use the string_scalar
class if you need an instance of a class object to represent a null string.
The string_view
contains comparison operators <,>,==,<=,>=
that can be used in many cudf functions like sort
without string-specific code. The data for a string_view
instance is required to be UTF-8 and all operators and methods expect this encoding. Unless documented otherwise, position and length parameters are specified in characters and not bytes. The class also includes a string_view::const_iterator
which can be used to navigate through individual characters within the string.
cudf::type_dispatcher
dispatches to the string_view
data type when invoked on a STRING
column.
The libcudf strings column only supports UTF-8 encoding for strings data. UTF-8 is a variable-length character encoding wherein each character can be 1-4 bytes. This means the length of a string is not the same as its size in bytes. For this reason, it is recommended to use the string_view
class to access these characters for most operations.
The string_view.cuh
header also includes some utility methods for reading and writing (to_char_utf8/from_char_utf8
) individual UTF-8 characters to/from byte arrays.
cudf::lists_column_view
is a view of a lists column. cudf::list_view
is a view of a single list, and therefore cudf::list_view
is the data type of a cudf::column
of type LIST
.
cudf::type_dispatcher
dispatches to the list_view
data type when invoked on a LIST
column.
cudf::structs_column_view
is a view of a structs column. cudf::struct_view
is a view of a single struct, and therefore cudf::struct_view
is the data type of a cudf::column
of type STRUCT
.
cudf::type_dispatcher
dispatches to the struct_view
data type when invoked on a STRUCT
column.
cuIO is a component of libcudf that provides GPU-accelerated reading and writing of data file formats commonly used in data analytics, including CSV, Parquet, ORC, Avro, and JSON_Lines.
// TODO: add more detail and move to a separate file.