libcudf  23.12.00
Classes | Typedefs | Enumerations | Functions | Variables
cudf::io Namespace Reference

IO interfaces. More...

Classes

class  arrow_io_source
 Implementation class for reading from an Apache Arrow file. The file could be a memory-mapped file or other implementation supported by Arrow. More...
 
class  avro_reader_options
 Settings to use for read_avro(). More...
 
class  avro_reader_options_builder
 Builder to build options for read_avro(). More...
 
class  csv_reader_options
 Settings to use for read_csv(). More...
 
class  csv_reader_options_builder
 Builder to build options for read_csv(). More...
 
class  csv_writer_options
 Settings to use for write_csv(). More...
 
class  csv_writer_options_builder
 Builder to build options for writer_csv() More...
 
class  data_sink
 Interface class for storing the output data from the writers. More...
 
class  datasource
 Interface class for providing input data to the readers. More...
 
struct  schema_element
 Allows specifying the target types for nested JSON data via json_reader_options' set_dtypes method. More...
 
class  json_reader_options
 Input arguments to the read_json interface. More...
 
class  json_reader_options_builder
 Builds settings to use for read_json(). More...
 
class  json_writer_options
 Settings to use for write_json(). More...
 
class  json_writer_options_builder
 Builder to build options for writer_json() More...
 
class  orc_reader_options
 Settings to use for read_orc(). More...
 
class  orc_reader_options_builder
 Builds settings to use for read_orc(). More...
 
class  orc_writer_options
 Settings to use for write_orc(). More...
 
class  orc_writer_options_builder
 Builds settings to use for write_orc(). More...
 
class  chunked_orc_writer_options
 Settings to use for write_orc_chunked(). More...
 
class  chunked_orc_writer_options_builder
 Builds settings to use for write_orc_chunked(). More...
 
class  orc_chunked_writer
 Chunked orc writer class writes an ORC file in a chunked/stream form. More...
 
struct  raw_orc_statistics
 Holds column names and buffers containing raw file-level and stripe-level statistics. More...
 
struct  minmax_statistics
 Base class for column statistics that include optional minimum and maximum. More...
 
struct  sum_statistics
 Base class for column statistics that include an optional sum. More...
 
struct  integer_statistics
 Statistics for integral columns. More...
 
struct  double_statistics
 Statistics for floating point columns. More...
 
struct  string_statistics
 Statistics for string columns. More...
 
struct  bucket_statistics
 Statistics for boolean columns. More...
 
struct  decimal_statistics
 Statistics for decimal columns. More...
 
struct  timestamp_statistics
 Statistics for timestamp columns. More...
 
struct  column_statistics
 Contains per-column ORC statistics. More...
 
struct  parsed_orc_statistics
 Holds column names and parsed file-level and stripe-level statistics. More...
 
struct  orc_column_schema
 Schema of an ORC column, including the nested columns. More...
 
struct  orc_schema
 Schema of an ORC file. More...
 
class  orc_metadata
 Information about content of an ORC file. More...
 
class  parquet_reader_options
 Settings for read_parquet(). More...
 
class  parquet_reader_options_builder
 Builds parquet_reader_options to use for read_parquet(). More...
 
class  chunked_parquet_reader
 The chunked parquet reader class to read Parquet file iteratively in to a series of tables, chunk by chunk. More...
 
class  parquet_writer_options
 Settings for write_parquet(). More...
 
class  parquet_writer_options_builder
 Class to build parquet_writer_options. More...
 
class  chunked_parquet_writer_options
 Settings for write_parquet_chunked(). More...
 
class  chunked_parquet_writer_options_builder
 Builds options for chunked_parquet_writer_options. More...
 
class  parquet_chunked_writer
 chunked parquet writer class to handle options and write tables in chunks. More...
 
struct  parquet_column_schema
 Schema of a parquet column, including the nested columns. More...
 
struct  parquet_schema
 Schema of a parquet file. More...
 
class  parquet_metadata
 Information about content of a parquet file. More...
 
class  writer_compression_statistics
 Statistics about compression performed by a writer. More...
 
struct  column_name_info
 Detailed name (and optionally nullability) information for output columns. More...
 
struct  table_metadata
 Table metadata returned by IO readers. More...
 
struct  table_with_metadata
 Table with table metadata used by io readers to return the metadata by value. More...
 
struct  host_buffer
 Non-owning view of a host memory buffer. More...
 
struct  source_info
 Source information for read interfaces. More...
 
struct  sink_info
 Destination information for write interfaces. More...
 
class  column_in_metadata
 Metadata for a column. More...
 
class  table_input_metadata
 Metadata for a table. More...
 
struct  partition_info
 Information used while writing partitioned datasets. More...
 
class  reader_column_schema
 schema element for reader More...
 

Typedefs

using no_statistics = std::monostate
 Monostate type alias for the statistics variant.
 
using date_statistics = minmax_statistics< int32_t >
 Statistics for date(time) columns.
 
using binary_statistics = sum_statistics< int64_t >
 Statistics for binary columns. More...
 

Enumerations

enum class  json_recovery_mode_t { FAIL , RECOVER_WITH_NULL }
 Control the error recovery behavior of the json parser. More...
 
enum class  compression_type {
  NONE , AUTO , SNAPPY , GZIP ,
  BZIP2 , BROTLI , ZIP , XZ ,
  ZLIB , LZ4 , LZO , ZSTD
}
 Compression algorithms. More...
 
enum class  io_type {
  FILEPATH , HOST_BUFFER , DEVICE_BUFFER , VOID ,
  USER_IMPLEMENTED
}
 Data source or destination types. More...
 
enum class  quote_style { MINIMAL , ALL , NONNUMERIC , NONE }
 Behavior when handling quotations in field data. More...
 
enum  statistics_freq { STATISTICS_NONE = 0 , STATISTICS_ROWGROUP = 1 , STATISTICS_PAGE = 2 , STATISTICS_COLUMN = 3 }
 Column statistics granularity type for parquet/orc writers. More...
 
enum  dictionary_policy { NEVER = 0 , ADAPTIVE = 1 , ALWAYS = 2 }
 Control use of dictionary encoding for parquet writer. More...
 

Functions

table_with_metadata read_avro (avro_reader_options const &options, rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource())
 Reads an Avro dataset into a set of columns. More...
 
table_with_metadata read_csv (csv_reader_options options, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource())
 Reads a CSV dataset into a set of columns. More...
 
void write_csv (csv_writer_options const &options, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource())
 Writes a set of columns to CSV format. More...
 
table_with_metadata read_json (json_reader_options options, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource())
 Reads a JSON dataset into a set of columns. More...
 
void write_json (json_writer_options const &options, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource())
 Writes a set of columns to JSON format. More...
 
table_with_metadata read_orc (orc_reader_options const &options, rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource())
 Reads an ORC dataset into a set of columns. More...
 
void write_orc (orc_writer_options const &options)
 Writes a set of columns to ORC format. More...
 
raw_orc_statistics read_raw_orc_statistics (source_info const &src_info)
 Reads file-level and stripe-level statistics of ORC dataset. More...
 
parsed_orc_statistics read_parsed_orc_statistics (source_info const &src_info)
 Reads file-level and stripe-level statistics of ORC dataset. More...
 
orc_metadata read_orc_metadata (source_info const &src_info)
 Reads metadata of ORC dataset. More...
 
table_with_metadata read_parquet (parquet_reader_options const &options, rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource())
 Reads a Parquet dataset into a set of columns. More...
 
std::unique_ptr< std::vector< uint8_t > > write_parquet (parquet_writer_options const &options)
 Writes a set of columns to parquet format. More...
 
std::unique_ptr< std::vector< uint8_t > > merge_row_group_metadata (std::vector< std::unique_ptr< std::vector< uint8_t >>> const &metadata_list)
 Merges multiple raw metadata blobs that were previously created by write_parquet into a single metadata blob. More...
 
parquet_metadata read_parquet_metadata (source_info const &src_info)
 Reads metadata of parquet dataset. More...
 
template<typename T >
constexpr auto is_byte_like_type ()
 Returns true if the type is byte-like, meaning it is reasonable to pass as a pointer to bytes. More...
 

Variables

constexpr size_t default_stripe_size_bytes = 64 * 1024 * 1024
 64MB default orc stripe size
 
constexpr size_type default_stripe_size_rows = 1000000
 1M rows default orc stripe rows
 
constexpr size_type default_row_index_stride = 10000
 10K rows default orc row index stride
 
constexpr size_t default_row_group_size_bytes = 128 * 1024 * 1024
 128MB per row group
 
constexpr size_type default_row_group_size_rows = 1000000
 1 million rows per row group
 
constexpr size_t default_max_page_size_bytes = 512 * 1024
 512KB per page
 
constexpr size_type default_max_page_size_rows = 20000
 20k rows per page
 
constexpr int32_t default_column_index_truncate_length = 64
 truncate to 64 bytes
 
constexpr size_t default_max_dictionary_size = 1024 * 1024
 1MB dictionary size
 
constexpr size_type default_max_page_fragment_size = 5000
 5000 rows per page fragment
 

Detailed Description

IO interfaces.

Typedef Documentation

◆ binary_statistics

using cudf::io::binary_statistics = typedef sum_statistics<int64_t>

Statistics for binary columns.

The sum is the total number of bytes across all elements.

Definition at line 135 of file orc_metadata.hpp.

Enumeration Type Documentation

◆ compression_type

Compression algorithms.

Enumerator
NONE 

No compression.

AUTO 

Automatically detect or select compression format.

SNAPPY 

Snappy format, using byte-oriented LZ77.

GZIP 

GZIP format, using DEFLATE algorithm.

BZIP2 

BZIP2 format, using Burrows-Wheeler transform.

BROTLI 

BROTLI format, using LZ77 + Huffman + 2nd order context modeling.

ZIP 

ZIP format, using DEFLATE algorithm.

XZ 

XZ format, using LZMA(2) algorithm.

ZLIB 

ZLIB format, using DEFLATE algorithm.

LZ4 

LZ4 format, using LZ77.

LZO 

Lempel–Ziv–Oberhumer format.

ZSTD 

Zstandard format.

Definition at line 50 of file io/types.hpp.

◆ dictionary_policy

Control use of dictionary encoding for parquet writer.

Enumerator
NEVER 

Never use dictionary encoding.

ADAPTIVE 

Use dictionary when it will not impact compression.

ALWAYS 

Use dictionary regardless of impact on compression.

Definition at line 197 of file io/types.hpp.

◆ io_type

enum cudf::io::io_type
strong

Data source or destination types.

Enumerator
FILEPATH 

Input/output is a file path.

HOST_BUFFER 

Input/output is a buffer in host memory.

DEVICE_BUFFER 

Input/output is a buffer in device memory.

VOID 

Input/output is nothing. No work is done. Useful for benchmarking.

USER_IMPLEMENTED 

Input/output is handled by a custom user class.

Definition at line 68 of file io/types.hpp.

◆ quote_style

enum cudf::io::quote_style
strong

Behavior when handling quotations in field data.

Enumerator
MINIMAL 

Quote only fields which contain special characters.

ALL 

Quote all fields.

NONNUMERIC 

Quote all non-numeric fields.

NONE 

Never quote fields; disable quotation parsing.

Definition at line 79 of file io/types.hpp.

◆ statistics_freq

Column statistics granularity type for parquet/orc writers.

Enumerator
STATISTICS_NONE 

No column statistics.

STATISTICS_ROWGROUP 

Per-Rowgroup column statistics.

STATISTICS_PAGE 

Per-page column statistics.

STATISTICS_COLUMN 

Full column and offset indices. Implies STATISTICS_ROWGROUP.

Definition at line 89 of file io/types.hpp.

Function Documentation

◆ is_byte_like_type()

template<typename T >
constexpr auto cudf::io::is_byte_like_type ( )
inlineconstexpr

Returns true if the type is byte-like, meaning it is reasonable to pass as a pointer to bytes.

Template Parameters
TThe representation type
Returns
true if the type is considered a byte-like type

Definition at line 277 of file io/types.hpp.