.. dask-cudf documentation coordinating file, created by
   sphinx-quickstart on Mon Feb  6 18:48:11 2023.
   You can adapt this file completely to your liking, but it should at least
   contain the root `toctree` directive.

Welcome to dask-cudf's documentation!
=====================================

Dask-cuDF is an extension library for the `Dask <https://dask.org>`__
parallel computing framework that provides a `cuDF
<https://docs.rapids.ai/api/cudf/stable/>`__-backed distributed
dataframe with the same API as `Dask dataframes
<https://docs.dask.org/en/stable/dataframe.html>`__.

If you are familiar with Dask and `pandas <pandas.pydata.org>`__ or
`cuDF <https://docs.rapids.ai/api/cudf/stable/>`__, then Dask-cuDF
should feel familiar to you. If not, we recommend starting with `10
minutes to Dask
<https://docs.dask.org/en/stable/10-minutes-to-dask.html>`__ followed
by `10 minutes to cuDF and Dask-cuDF
<https://docs.rapids.ai/api/cudf/stable/user_guide/10min.html>`__.

When running on multi-GPU systems, `Dask-CUDA
<https://docs.rapids.ai/api/dask-cuda/stable/>`__ is recommended to
simplify the setup of the cluster, taking advantage of all features of
the GPU and networking hardware.

Using Dask-cuDF
---------------

When installed, Dask-cuDF registers itself as a dataframe backend for
Dask. This means that in many cases, using cuDF-backed dataframes requires
only small changes to an existing workflow. The minimal change is to
select cuDF as the dataframe backend in :doc:`Dask's
configuration <dask:configuration>`. To do so, we must set the option
``dataframe.backend`` to ``cudf``. From Python, this can be achieved
like so::

  import dask

  dask.config.set({"dataframe.backend": "cudf"})

Alternatively, you can set ``DASK_DATAFRAME__BACKEND=cudf`` in the
environment before running your code.

Dataframe creation from on-disk formats
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

If your workflow creates Dask dataframes from on-disk formats
(for example using :func:`dask.dataframe.read_parquet`), then setting
the backend may well be enough to migrate your workflow.

For example, consider reading a dataframe from parquet::

   import dask.dataframe as dd

   # By default, we obtain a pandas-backed dataframe
   df = dd.read_parquet("data.parquet", ...)


To obtain a cuDF-backed dataframe, we must set the
``dataframe.backend`` configuration option::

  import dask
  import dask.dataframe as dd

  dask.config.set({"dataframe.backend": "cudf"})
  # This gives us a cuDF-backed dataframe
  df = dd.read_parquet("data.parquet", ...)

This code will use cuDF's GPU-accelerated :func:`parquet reader
<cudf.read_parquet>` to read partitions of the data.

Dataframe creation from in-memory formats
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

If you already have a dataframe in memory and want to convert it to a
cuDF-backend one, there are two options depending on whether the
dataframe is already a Dask one or not. If you have a Dask dataframe,
then you can call :func:`dask.dataframe.to_backend` passing ``"cudf"``
as the backend; if you have a pandas dataframe then you can either
call :func:`dask.dataframe.from_pandas` followed by
:func:`~dask.dataframe.to_backend` or first convert the dataframe with
:func:`cudf.from_pandas` and then parallelise this with
:func:`dask_cudf.from_cudf`.

API Reference
-------------

Generally speaking, Dask-cuDF tries to offer exactly the same API as
Dask itself. There are, however, some minor differences mostly because
cuDF does not :doc:`perfectly mirror <cudf:user_guide/PandasCompat>`
the pandas API, or because cuDF provides additional configuration
flags (these mostly occur in data reading and writing interfaces).

As a result, straightforward workflows can be migrated without too
much trouble, but more complex ones that utilise more features may
need a bit of tweaking. The API documentation describes details of the
differences and all functionality that Dask-cuDF supports.

.. toctree::
   :maxdepth: 2

   api


Indices and tables
==================

* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`