{ "cells": [ { "cell_type": "markdown", "id": "f8ffbea7", "metadata": {}, "source": [ "# Working with missing data" ] }, { "cell_type": "markdown", "id": "7e3ab093", "metadata": {}, "source": [ "In this section, we will discuss missing (also referred to as `NA`) values in cudf. cudf supports having missing values in all dtypes. These missing values are represented by `<NA>`. These values are also referenced as \"null values\"." ] }, { "cell_type": "markdown", "id": "8d657a82", "metadata": {}, "source": [ "## How to Detect missing values" ] }, { "cell_type": "markdown", "id": "9ea9f672", "metadata": {}, "source": [ "To detect missing values, you can use `isna()` and `notna()` functions." ] }, { "cell_type": "code", "execution_count": 1, "id": "58050adb", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "\n", "import cudf" ] }, { "cell_type": "code", "execution_count": 2, "id": "416d73da", "metadata": {}, "outputs": [], "source": [ "df = cudf.DataFrame({\"a\": [1, 2, None, 4], \"b\": [0.1, None, 2.3, 17.17]})" ] }, { "cell_type": "code", "execution_count": 3, "id": "5dfc6bc3", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>a</th>\n", " <th>b</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>1</td>\n", " <td>0.1</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>2</td>\n", " <td><NA></td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td><NA></td>\n", " <td>2.3</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>4</td>\n", " <td>17.17</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " a b\n", "0 1 0.1\n", "1 2 <NA>\n", "2 <NA> 2.3\n", "3 4 17.17" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "code", "execution_count": 4, "id": "4d7f7a6d", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>a</th>\n", " <th>b</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>False</td>\n", " <td>False</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>False</td>\n", " <td>True</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>True</td>\n", " <td>False</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>False</td>\n", " <td>False</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " a b\n", "0 False False\n", "1 False True\n", "2 True False\n", "3 False False" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.isna()" ] }, { "cell_type": "code", "execution_count": 5, "id": "40edca67", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 True\n", "1 True\n", "2 False\n", "3 True\n", "Name: a, dtype: bool" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[\"a\"].notna()" ] }, { "cell_type": "markdown", "id": "acdf29d7", "metadata": {}, "source": [ "One has to be mindful that in Python (and NumPy), the nan's don't compare equal, but None's do. Note that cudf/NumPy uses the fact that `np.nan != np.nan`, and treats `None` like `np.nan`." ] }, { "cell_type": "code", "execution_count": 6, "id": "c269c1f5", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "None == None" ] }, { "cell_type": "code", "execution_count": 7, "id": "99fb083a", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.nan == np.nan" ] }, { "cell_type": "markdown", "id": "4fdb8bc7", "metadata": {}, "source": [ "So as compared to above, a scalar equality comparison versus a None/np.nan doesn't provide useful information." ] }, { "cell_type": "code", "execution_count": 8, "id": "630ef6bb", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 False\n", "1 <NA>\n", "2 False\n", "3 False\n", "Name: b, dtype: bool" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[\"b\"] == np.nan" ] }, { "cell_type": "code", "execution_count": 9, "id": "8162e383", "metadata": {}, "outputs": [], "source": [ "s = cudf.Series([None, 1, 2])" ] }, { "cell_type": "code", "execution_count": 10, "id": "199775b3", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 <NA>\n", "1 1\n", "2 2\n", "dtype: int64" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s" ] }, { "cell_type": "code", "execution_count": 11, "id": "cd09d80c", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 <NA>\n", "1 <NA>\n", "2 <NA>\n", "dtype: bool" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s == None" ] }, { "cell_type": "code", "execution_count": 12, "id": "6b23bb0c", "metadata": {}, "outputs": [], "source": [ "s = cudf.Series([1, 2, np.nan], nan_as_null=False)" ] }, { "cell_type": "code", "execution_count": 13, "id": "cafb79ee", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 1.0\n", "1 2.0\n", "2 NaN\n", "dtype: float64" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s" ] }, { "cell_type": "code", "execution_count": 14, "id": "13363897", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 False\n", "1 False\n", "2 False\n", "dtype: bool" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s == np.nan" ] }, { "cell_type": "markdown", "id": "208a3776", "metadata": {}, "source": [ "## Float dtypes and missing data" ] }, { "cell_type": "markdown", "id": "2c174b88", "metadata": {}, "source": [ "Because ``NaN`` is a float, a column of integers with even one missing values is cast to floating-point dtype. However this doesn't happen by default.\n", "\n", "By default if a ``NaN`` value is passed to `Series` constructor, it is treated as `<NA>` value." ] }, { "cell_type": "code", "execution_count": 15, "id": "c59c3c54", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 1\n", "1 2\n", "2 <NA>\n", "dtype: int64" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cudf.Series([1, 2, np.nan])" ] }, { "cell_type": "markdown", "id": "a9eb2d9c", "metadata": {}, "source": [ "Hence to consider a ``NaN`` as ``NaN`` you will have to pass `nan_as_null=False` parameter into `Series` constructor." ] }, { "cell_type": "code", "execution_count": 16, "id": "ecc5ae92", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 1.0\n", "1 2.0\n", "2 NaN\n", "dtype: float64" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cudf.Series([1, 2, np.nan], nan_as_null=False)" ] }, { "cell_type": "markdown", "id": "d1db7b08", "metadata": {}, "source": [ "## Datetimes" ] }, { "cell_type": "markdown", "id": "548d3734", "metadata": {}, "source": [ "For `datetime64` types, cudf doesn't support having `NaT` values. Instead these values which are specific to numpy and pandas are considered as null values(`<NA>`) in cudf. The actual underlying value of `NaT` is `min(int64)` and cudf retains the underlying value when converting a cudf object to pandas object." ] }, { "cell_type": "code", "execution_count": 17, "id": "de70f244", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 2012-01-01 00:00:00.000000\n", "1 <NA>\n", "2 2012-01-01 00:00:00.000000\n", "dtype: datetime64[us]" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "datetime_series = cudf.Series(\n", " [pd.Timestamp(\"20120101\"), pd.NaT, pd.Timestamp(\"20120101\")]\n", ")\n", "datetime_series" ] }, { "cell_type": "code", "execution_count": 18, "id": "8411a914", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 2012-01-01\n", "1 NaT\n", "2 2012-01-01\n", "dtype: datetime64[ns]" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "datetime_series.to_pandas()" ] }, { "cell_type": "markdown", "id": "df664145", "metadata": {}, "source": [ "any operations on rows having `<NA>` values in `datetime` column will result in `<NA>` value at the same location in resulting column:" ] }, { "cell_type": "code", "execution_count": 19, "id": "829c32d0", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 0 days 00:00:00\n", "1 <NA>\n", "2 0 days 00:00:00\n", "dtype: timedelta64[us]" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "datetime_series - datetime_series" ] }, { "cell_type": "markdown", "id": "aa8031ef", "metadata": {}, "source": [ "## Calculations with missing data" ] }, { "cell_type": "markdown", "id": "c587fae2", "metadata": {}, "source": [ "Null values propagate naturally through arithmetic operations between pandas objects." ] }, { "cell_type": "code", "execution_count": 20, "id": "f8f2aec7", "metadata": {}, "outputs": [], "source": [ "df1 = cudf.DataFrame(\n", " {\n", " \"a\": [1, None, 2, 3, None],\n", " \"b\": cudf.Series([np.nan, 2, 3.2, 0.1, 1], nan_as_null=False),\n", " }\n", ")" ] }, { "cell_type": "code", "execution_count": 21, "id": "0c8a3011", "metadata": {}, "outputs": [], "source": [ "df2 = cudf.DataFrame(\n", " {\"a\": [1, 11, 2, 34, 10], \"b\": cudf.Series([0.23, 22, 3.2, None, 1])}\n", ")" ] }, { "cell_type": "code", "execution_count": 22, "id": "052f6c2b", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>a</th>\n", " <th>b</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>1</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td><NA></td>\n", " <td>2.0</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>2</td>\n", " <td>3.2</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>3</td>\n", " <td>0.1</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td><NA></td>\n", " <td>1.0</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " a b\n", "0 1 NaN\n", "1 <NA> 2.0\n", "2 2 3.2\n", "3 3 0.1\n", "4 <NA> 1.0" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1" ] }, { "cell_type": "code", "execution_count": 23, "id": "0fb0a083", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>a</th>\n", " <th>b</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>1</td>\n", " <td>0.23</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>11</td>\n", " <td>22.0</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>2</td>\n", " <td>3.2</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>34</td>\n", " <td><NA></td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>10</td>\n", " <td>1.0</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " a b\n", "0 1 0.23\n", "1 11 22.0\n", "2 2 3.2\n", "3 34 <NA>\n", "4 10 1.0" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2" ] }, { "cell_type": "code", "execution_count": 24, "id": "6f8152c0", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>a</th>\n", " <th>b</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>2</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td><NA></td>\n", " <td>24.0</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>4</td>\n", " <td>6.4</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>37</td>\n", " <td><NA></td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td><NA></td>\n", " <td>2.0</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " a b\n", "0 2 NaN\n", "1 <NA> 24.0\n", "2 4 6.4\n", "3 37 <NA>\n", "4 <NA> 2.0" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1 + df2" ] }, { "cell_type": "markdown", "id": "11170d49", "metadata": {}, "source": [ "While summing the data along a series, `NA` values will be treated as `0`." ] }, { "cell_type": "code", "execution_count": 25, "id": "45081790", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 1\n", "1 <NA>\n", "2 2\n", "3 3\n", "4 <NA>\n", "Name: a, dtype: int64" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1[\"a\"]" ] }, { "cell_type": "code", "execution_count": 26, "id": "39922658", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "6" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1[\"a\"].sum()" ] }, { "cell_type": "markdown", "id": "6e99afe0", "metadata": {}, "source": [ "Since `NA` values are treated as `0`, the mean would result to 2 in this case `(1 + 0 + 2 + 3 + 0)/5 = 2`" ] }, { "cell_type": "code", "execution_count": 27, "id": "b2f16ddb", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2.0" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1[\"a\"].mean()" ] }, { "cell_type": "markdown", "id": "07f2ec5a", "metadata": {}, "source": [ "To preserve `NA` values in the above calculations, `sum` & `mean` support `skipna` parameter.\n", "By default it's value is\n", "set to `True`, we can change it to `False` to preserve `NA` values." ] }, { "cell_type": "code", "execution_count": 28, "id": "d4a463a0", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "nan" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1[\"a\"].sum(skipna=False)" ] }, { "cell_type": "code", "execution_count": 29, "id": "a944c42e", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "nan" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1[\"a\"].mean(skipna=False)" ] }, { "cell_type": "markdown", "id": "fb8c8f18", "metadata": {}, "source": [ "Cumulative methods like `cumsum` and `cumprod` ignore `NA` values by default." ] }, { "cell_type": "code", "execution_count": 30, "id": "4f2a7306", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 1\n", "1 <NA>\n", "2 3\n", "3 6\n", "4 <NA>\n", "Name: a, dtype: int64" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1[\"a\"].cumsum()" ] }, { "cell_type": "markdown", "id": "c8f6054b", "metadata": {}, "source": [ "To preserve `NA` values in cumulative methods, provide `skipna=False`." ] }, { "cell_type": "code", "execution_count": 31, "id": "d4c46776", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 1\n", "1 <NA>\n", "2 <NA>\n", "3 <NA>\n", "4 <NA>\n", "Name: a, dtype: int64" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1[\"a\"].cumsum(skipna=False)" ] }, { "cell_type": "markdown", "id": "67077d65", "metadata": {}, "source": [ "## Sum/product of Null/nans" ] }, { "cell_type": "markdown", "id": "ffbb9ca1", "metadata": {}, "source": [ "The sum of an empty or all-NA Series of a DataFrame is 0." ] }, { "cell_type": "code", "execution_count": 32, "id": "f430c9ce", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.0" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cudf.Series([np.nan], nan_as_null=False).sum()" ] }, { "cell_type": "code", "execution_count": 33, "id": "7fde514b", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "nan" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cudf.Series([np.nan], nan_as_null=False).sum(skipna=False)" ] }, { "cell_type": "code", "execution_count": 34, "id": "56cedd17", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.0" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cudf.Series([], dtype=\"float64\").sum()" ] }, { "cell_type": "markdown", "id": "cb188adb", "metadata": {}, "source": [ "The product of an empty or all-NA Series of a DataFrame is 1." ] }, { "cell_type": "code", "execution_count": 35, "id": "d20bbbef", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1.0" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cudf.Series([np.nan], nan_as_null=False).prod()" ] }, { "cell_type": "code", "execution_count": 36, "id": "75abbcfa", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "nan" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cudf.Series([np.nan], nan_as_null=False).prod(skipna=False)" ] }, { "cell_type": "code", "execution_count": 37, "id": "becce0cc", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1.0" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cudf.Series([], dtype=\"float64\").prod()" ] }, { "cell_type": "markdown", "id": "0e899e03", "metadata": {}, "source": [ "## NA values in GroupBy" ] }, { "cell_type": "markdown", "id": "7fb20874", "metadata": {}, "source": [ "`NA` groups in GroupBy are automatically excluded. For example:" ] }, { "cell_type": "code", "execution_count": 38, "id": "1379037c", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>a</th>\n", " <th>b</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>1</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td><NA></td>\n", " <td>2.0</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>2</td>\n", " <td>3.2</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>3</td>\n", " <td>0.1</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td><NA></td>\n", " <td>1.0</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " a b\n", "0 1 NaN\n", "1 <NA> 2.0\n", "2 2 3.2\n", "3 3 0.1\n", "4 <NA> 1.0" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1" ] }, { "cell_type": "code", "execution_count": 39, "id": "d6b91e6f", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>b</th>\n", " </tr>\n", " <tr>\n", " <th>a</th>\n", " <th></th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>2</th>\n", " <td>3.2</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>0.1</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " b\n", "a \n", "2 3.2\n", "1 NaN\n", "3 0.1" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1.groupby(\"a\").mean()" ] }, { "cell_type": "markdown", "id": "cb83fb11", "metadata": {}, "source": [ "It is also possible to include `NA` in groups by passing `dropna=False`" ] }, { "cell_type": "code", "execution_count": 40, "id": "768c3e50", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>b</th>\n", " </tr>\n", " <tr>\n", " <th>a</th>\n", " <th></th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>2</th>\n", " <td>3.2</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>0.1</td>\n", " </tr>\n", " <tr>\n", " <th><NA></th>\n", " <td>1.5</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " b\n", "a \n", "2 3.2\n", "1 NaN\n", "3 0.1\n", "<NA> 1.5" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1.groupby(\"a\", dropna=False).mean()" ] }, { "cell_type": "markdown", "id": "133816b4", "metadata": {}, "source": [ "## Inserting missing data" ] }, { "cell_type": "markdown", "id": "306082ad", "metadata": {}, "source": [ "All dtypes support insertion of missing value by assignment. Any specific location in series can made null by assigning it to `None`." ] }, { "cell_type": "code", "execution_count": 41, "id": "7ddde1fe", "metadata": {}, "outputs": [], "source": [ "series = cudf.Series([1, 2, 3, 4])" ] }, { "cell_type": "code", "execution_count": 42, "id": "16e54597", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 1\n", "1 2\n", "2 3\n", "3 4\n", "dtype: int64" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "series" ] }, { "cell_type": "code", "execution_count": 43, "id": "f628f94d", "metadata": {}, "outputs": [], "source": [ "series[2] = None" ] }, { "cell_type": "code", "execution_count": 44, "id": "b30590b7", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 1\n", "1 2\n", "2 <NA>\n", "3 4\n", "dtype: int64" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "series" ] }, { "cell_type": "markdown", "id": "a1b123d0", "metadata": {}, "source": [ "## Filling missing values: fillna" ] }, { "cell_type": "markdown", "id": "114aa23a", "metadata": {}, "source": [ "`fillna()` can fill in `NA` & `NaN` values with non-NA data." ] }, { "cell_type": "code", "execution_count": 45, "id": "59e22668", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>a</th>\n", " <th>b</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>1</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td><NA></td>\n", " <td>2.0</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>2</td>\n", " <td>3.2</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>3</td>\n", " <td>0.1</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td><NA></td>\n", " <td>1.0</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " a b\n", "0 1 NaN\n", "1 <NA> 2.0\n", "2 2 3.2\n", "3 3 0.1\n", "4 <NA> 1.0" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1" ] }, { "cell_type": "code", "execution_count": 46, "id": "05c221ee", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 10.0\n", "1 2.0\n", "2 3.2\n", "3 0.1\n", "4 1.0\n", "Name: b, dtype: float64" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1[\"b\"].fillna(10)" ] }, { "cell_type": "markdown", "id": "401f91b2", "metadata": {}, "source": [ "## Filling with cudf Object" ] }, { "cell_type": "markdown", "id": "e79346d6", "metadata": {}, "source": [ "You can also fillna using a dict or Series that is alignable. The labels of the dict or index of the Series must match the columns of the frame you wish to fill. The use case of this is to fill a DataFrame with the mean of that column." ] }, { "cell_type": "code", "execution_count": 47, "id": "f52c5d8f", "metadata": {}, "outputs": [], "source": [ "import cupy as cp\n", "\n", "dff = cudf.DataFrame(cp.random.randn(10, 3), columns=list(\"ABC\"))" ] }, { "cell_type": "code", "execution_count": 48, "id": "6affebe9", "metadata": {}, "outputs": [], "source": [ "dff.iloc[3:5, 0] = np.nan" ] }, { "cell_type": "code", "execution_count": 49, "id": "1ce1b96f", "metadata": {}, "outputs": [], "source": [ "dff.iloc[4:6, 1] = np.nan" ] }, { "cell_type": "code", "execution_count": 50, "id": "90829195", "metadata": {}, "outputs": [], "source": [ "dff.iloc[5:8, 2] = np.nan" ] }, { "cell_type": "code", "execution_count": 51, "id": "c0feac14", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>A</th>\n", " <th>B</th>\n", " <th>C</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>-0.408268</td>\n", " <td>-0.676643</td>\n", " <td>-1.274743</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>-0.029322</td>\n", " <td>-0.873593</td>\n", " <td>-1.214105</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>-0.866371</td>\n", " <td>1.081735</td>\n", " <td>-0.226840</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>NaN</td>\n", " <td>0.812278</td>\n", " <td>1.074973</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>-0.366725</td>\n", " </tr>\n", " <tr>\n", " <th>5</th>\n", " <td>-1.016239</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>6</th>\n", " <td>0.675123</td>\n", " <td>1.067536</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>7</th>\n", " <td>0.221568</td>\n", " <td>2.025961</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>8</th>\n", " <td>-0.317241</td>\n", " <td>1.011275</td>\n", " <td>0.674891</td>\n", " </tr>\n", " <tr>\n", " <th>9</th>\n", " <td>-0.877041</td>\n", " <td>-1.919394</td>\n", " <td>-1.029201</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " A B C\n", "0 -0.408268 -0.676643 -1.274743\n", "1 -0.029322 -0.873593 -1.214105\n", "2 -0.866371 1.081735 -0.226840\n", "3 NaN 0.812278 1.074973\n", "4 NaN NaN -0.366725\n", "5 -1.016239 NaN NaN\n", "6 0.675123 1.067536 NaN\n", "7 0.221568 2.025961 NaN\n", "8 -0.317241 1.011275 0.674891\n", "9 -0.877041 -1.919394 -1.029201" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dff" ] }, { "cell_type": "code", "execution_count": 52, "id": "a07c1260", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>A</th>\n", " <th>B</th>\n", " <th>C</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>-0.408268</td>\n", " <td>-0.676643</td>\n", " <td>-1.274743</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>-0.029322</td>\n", " <td>-0.873593</td>\n", " <td>-1.214105</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>-0.866371</td>\n", " <td>1.081735</td>\n", " <td>-0.226840</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>-0.327224</td>\n", " <td>0.812278</td>\n", " <td>1.074973</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>-0.327224</td>\n", " <td>0.316145</td>\n", " <td>-0.366725</td>\n", " </tr>\n", " <tr>\n", " <th>5</th>\n", " <td>-1.016239</td>\n", " <td>0.316145</td>\n", " <td>-0.337393</td>\n", " </tr>\n", " <tr>\n", " <th>6</th>\n", " <td>0.675123</td>\n", " <td>1.067536</td>\n", " <td>-0.337393</td>\n", " </tr>\n", " <tr>\n", " <th>7</th>\n", " <td>0.221568</td>\n", " <td>2.025961</td>\n", " <td>-0.337393</td>\n", " </tr>\n", " <tr>\n", " <th>8</th>\n", " <td>-0.317241</td>\n", " <td>1.011275</td>\n", " <td>0.674891</td>\n", " </tr>\n", " <tr>\n", " <th>9</th>\n", " <td>-0.877041</td>\n", " <td>-1.919394</td>\n", " <td>-1.029201</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " A B C\n", "0 -0.408268 -0.676643 -1.274743\n", "1 -0.029322 -0.873593 -1.214105\n", "2 -0.866371 1.081735 -0.226840\n", "3 -0.327224 0.812278 1.074973\n", "4 -0.327224 0.316145 -0.366725\n", "5 -1.016239 0.316145 -0.337393\n", "6 0.675123 1.067536 -0.337393\n", "7 0.221568 2.025961 -0.337393\n", "8 -0.317241 1.011275 0.674891\n", "9 -0.877041 -1.919394 -1.029201" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dff.fillna(dff.mean())" ] }, { "cell_type": "code", "execution_count": 53, "id": "9e70d61a", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>A</th>\n", " <th>B</th>\n", " <th>C</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>-0.408268</td>\n", " <td>-0.676643</td>\n", " <td>-1.274743</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>-0.029322</td>\n", " <td>-0.873593</td>\n", " <td>-1.214105</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>-0.866371</td>\n", " <td>1.081735</td>\n", " <td>-0.226840</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>NaN</td>\n", " <td>0.812278</td>\n", " <td>1.074973</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>NaN</td>\n", " <td>0.316145</td>\n", " <td>-0.366725</td>\n", " </tr>\n", " <tr>\n", " <th>5</th>\n", " <td>-1.016239</td>\n", " <td>0.316145</td>\n", " <td>-0.337393</td>\n", " </tr>\n", " <tr>\n", " <th>6</th>\n", " <td>0.675123</td>\n", " <td>1.067536</td>\n", " <td>-0.337393</td>\n", " </tr>\n", " <tr>\n", " <th>7</th>\n", " <td>0.221568</td>\n", " <td>2.025961</td>\n", " <td>-0.337393</td>\n", " </tr>\n", " <tr>\n", " <th>8</th>\n", " <td>-0.317241</td>\n", " <td>1.011275</td>\n", " <td>0.674891</td>\n", " </tr>\n", " <tr>\n", " <th>9</th>\n", " <td>-0.877041</td>\n", " <td>-1.919394</td>\n", " <td>-1.029201</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " A B C\n", "0 -0.408268 -0.676643 -1.274743\n", "1 -0.029322 -0.873593 -1.214105\n", "2 -0.866371 1.081735 -0.226840\n", "3 NaN 0.812278 1.074973\n", "4 NaN 0.316145 -0.366725\n", "5 -1.016239 0.316145 -0.337393\n", "6 0.675123 1.067536 -0.337393\n", "7 0.221568 2.025961 -0.337393\n", "8 -0.317241 1.011275 0.674891\n", "9 -0.877041 -1.919394 -1.029201" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dff.fillna(dff.mean()[1:3])" ] }, { "cell_type": "markdown", "id": "0ace728d", "metadata": {}, "source": [ "## Dropping axis labels with missing data: dropna" ] }, { "cell_type": "markdown", "id": "2ccd7115", "metadata": {}, "source": [ "Missing data can be excluded using `dropna()`:" ] }, { "cell_type": "code", "execution_count": 54, "id": "98c57be7", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>a</th>\n", " <th>b</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>1</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td><NA></td>\n", " <td>2.0</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>2</td>\n", " <td>3.2</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>3</td>\n", " <td>0.1</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td><NA></td>\n", " <td>1.0</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " a b\n", "0 1 NaN\n", "1 <NA> 2.0\n", "2 2 3.2\n", "3 3 0.1\n", "4 <NA> 1.0" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1" ] }, { "cell_type": "code", "execution_count": 55, "id": "bc3f273a", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>a</th>\n", " <th>b</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>2</th>\n", " <td>2</td>\n", " <td>3.2</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>3</td>\n", " <td>0.1</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " a b\n", "2 2 3.2\n", "3 3 0.1" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1.dropna(axis=0)" ] }, { "cell_type": "code", "execution_count": 56, "id": "a48d4de0", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ "Empty DataFrame\n", "Columns: []\n", "Index: [0, 1, 2, 3, 4]" ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1.dropna(axis=1)" ] }, { "cell_type": "markdown", "id": "0b1954f9", "metadata": {}, "source": [ "An equivalent `dropna()` is available for Series." ] }, { "cell_type": "code", "execution_count": 57, "id": "2dd8f660", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 1\n", "2 2\n", "3 3\n", "Name: a, dtype: int64" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1[\"a\"].dropna()" ] }, { "cell_type": "markdown", "id": "121eb6d7", "metadata": {}, "source": [ "## Replacing generic values" ] }, { "cell_type": "markdown", "id": "3cc4c5f1", "metadata": {}, "source": [ "Often times we want to replace arbitrary values with other values.\n", "\n", "`replace()` in Series and `replace()` in DataFrame provides an efficient yet flexible way to perform such replacements." ] }, { "cell_type": "code", "execution_count": 58, "id": "e6c14e8a", "metadata": {}, "outputs": [], "source": [ "series = cudf.Series([0.0, 1.0, 2.0, 3.0, 4.0])" ] }, { "cell_type": "code", "execution_count": 59, "id": "a852f0cb", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 0.0\n", "1 1.0\n", "2 2.0\n", "3 3.0\n", "4 4.0\n", "dtype: float64" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "series" ] }, { "cell_type": "code", "execution_count": 60, "id": "f6ac12eb", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 5.0\n", "1 1.0\n", "2 2.0\n", "3 3.0\n", "4 4.0\n", "dtype: float64" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "series.replace(0, 5)" ] }, { "cell_type": "markdown", "id": "a6e1b6d7", "metadata": {}, "source": [ "We can also replace any value with a `<NA>` value." ] }, { "cell_type": "code", "execution_count": 61, "id": "f0156bff", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 <NA>\n", "1 1.0\n", "2 2.0\n", "3 3.0\n", "4 4.0\n", "dtype: float64" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "series.replace(0, None)" ] }, { "cell_type": "markdown", "id": "6673eefb", "metadata": {}, "source": [ "You can replace a list of values by a list of other values:" ] }, { "cell_type": "code", "execution_count": 62, "id": "f3110f5b", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 4.0\n", "1 3.0\n", "2 2.0\n", "3 1.0\n", "4 0.0\n", "dtype: float64" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" } ], "source": [ "series.replace([0, 1, 2, 3, 4], [4, 3, 2, 1, 0])" ] }, { "cell_type": "markdown", "id": "61521e8b", "metadata": {}, "source": [ "You can also specify a mapping dict:" ] }, { "cell_type": "code", "execution_count": 63, "id": "45862d05", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 10.0\n", "1 100.0\n", "2 2.0\n", "3 3.0\n", "4 4.0\n", "dtype: float64" ] }, "execution_count": 63, "metadata": {}, "output_type": "execute_result" } ], "source": [ "series.replace({0: 10, 1: 100})" ] }, { "cell_type": "markdown", "id": "04a34549", "metadata": {}, "source": [ "For a DataFrame, you can specify individual values by column:" ] }, { "cell_type": "code", "execution_count": 64, "id": "348caa64", "metadata": {}, "outputs": [], "source": [ "df = cudf.DataFrame({\"a\": [0, 1, 2, 3, 4], \"b\": [5, 6, 7, 8, 9]})" ] }, { "cell_type": "code", "execution_count": 65, "id": "cca41ec4", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>a</th>\n", " <th>b</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>0</td>\n", " <td>5</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>1</td>\n", " <td>6</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>2</td>\n", " <td>7</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>3</td>\n", " <td>8</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>4</td>\n", " <td>9</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " a b\n", "0 0 5\n", "1 1 6\n", "2 2 7\n", "3 3 8\n", "4 4 9" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "code", "execution_count": 66, "id": "64334693", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>a</th>\n", " <th>b</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>100</td>\n", " <td>100</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>1</td>\n", " <td>6</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>2</td>\n", " <td>7</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>3</td>\n", " <td>8</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>4</td>\n", " <td>9</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " a b\n", "0 100 100\n", "1 1 6\n", "2 2 7\n", "3 3 8\n", "4 4 9" ] }, "execution_count": 66, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.replace({\"a\": 0, \"b\": 5}, 100)" ] }, { "cell_type": "markdown", "id": "2f0ceec7", "metadata": {}, "source": [ "## String/regular expression replacement" ] }, { "cell_type": "markdown", "id": "c6f44740", "metadata": {}, "source": [ "cudf supports replacing string values using `replace` API:" ] }, { "cell_type": "code", "execution_count": 67, "id": "031d3533", "metadata": {}, "outputs": [], "source": [ "d = {\"a\": list(range(4)), \"b\": list(\"ab..\"), \"c\": [\"a\", \"b\", None, \"d\"]}" ] }, { "cell_type": "code", "execution_count": 68, "id": "12b41efb", "metadata": {}, "outputs": [], "source": [ "df = cudf.DataFrame(d)" ] }, { "cell_type": "code", "execution_count": 69, "id": "d450df49", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>a</th>\n", " <th>b</th>\n", " <th>c</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>0</td>\n", " <td>a</td>\n", " <td>a</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>1</td>\n", " <td>b</td>\n", " <td>b</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>2</td>\n", " <td>.</td>\n", " <td><NA></td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>3</td>\n", " <td>.</td>\n", " <td>d</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " a b c\n", "0 0 a a\n", "1 1 b b\n", "2 2 . <NA>\n", "3 3 . d" ] }, "execution_count": 69, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "code", "execution_count": 70, "id": "f823bc46", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>a</th>\n", " <th>b</th>\n", " <th>c</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>0</td>\n", " <td>a</td>\n", " <td>a</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>1</td>\n", " <td>b</td>\n", " <td>b</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>2</td>\n", " <td>A Dot</td>\n", " <td><NA></td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>3</td>\n", " <td>A Dot</td>\n", " <td>d</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " a b c\n", "0 0 a a\n", "1 1 b b\n", "2 2 A Dot <NA>\n", "3 3 A Dot d" ] }, "execution_count": 70, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.replace(\".\", \"A Dot\")" ] }, { "cell_type": "code", "execution_count": 71, "id": "bc52f6e9", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>a</th>\n", " <th>b</th>\n", " <th>c</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>0</td>\n", " <td>a</td>\n", " <td>a</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>1</td>\n", " <td><NA></td>\n", " <td><NA></td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>2</td>\n", " <td>A Dot</td>\n", " <td><NA></td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>3</td>\n", " <td>A Dot</td>\n", " <td>d</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " a b c\n", "0 0 a a\n", "1 1 <NA> <NA>\n", "2 2 A Dot <NA>\n", "3 3 A Dot d" ] }, "execution_count": 71, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.replace([\".\", \"b\"], [\"A Dot\", None])" ] }, { "cell_type": "markdown", "id": "7c1087be", "metadata": {}, "source": [ "Replace a few different values (list -> list):" ] }, { "cell_type": "code", "execution_count": 72, "id": "7e23eba9", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>a</th>\n", " <th>b</th>\n", " <th>c</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>0</td>\n", " <td>b</td>\n", " <td>b</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>1</td>\n", " <td>b</td>\n", " <td>b</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>2</td>\n", " <td>--</td>\n", " <td><NA></td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>3</td>\n", " <td>--</td>\n", " <td>d</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " a b c\n", "0 0 b b\n", "1 1 b b\n", "2 2 -- <NA>\n", "3 3 -- d" ] }, "execution_count": 72, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.replace([\"a\", \".\"], [\"b\", \"--\"])" ] }, { "cell_type": "markdown", "id": "42845a9c", "metadata": {}, "source": [ "Only search in column 'b' (dict -> dict):" ] }, { "cell_type": "code", "execution_count": 73, "id": "d2e79805", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>a</th>\n", " <th>b</th>\n", " <th>c</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>0</td>\n", " <td>a</td>\n", " <td>a</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>1</td>\n", " <td>b</td>\n", " <td>b</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>2</td>\n", " <td>replacement value</td>\n", " <td><NA></td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>3</td>\n", " <td>replacement value</td>\n", " <td>d</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " a b c\n", "0 0 a a\n", "1 1 b b\n", "2 2 replacement value <NA>\n", "3 3 replacement value d" ] }, "execution_count": 73, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.replace({\"b\": \".\"}, {\"b\": \"replacement value\"})" ] }, { "cell_type": "markdown", "id": "774b42a6", "metadata": {}, "source": [ "## Numeric replacement" ] }, { "cell_type": "markdown", "id": "1c1926ac", "metadata": {}, "source": [ "`replace()` can also be used similar to `fillna()`." ] }, { "cell_type": "code", "execution_count": 74, "id": "355a2f0d", "metadata": {}, "outputs": [], "source": [ "df = cudf.DataFrame(cp.random.randn(10, 2))" ] }, { "cell_type": "code", "execution_count": 75, "id": "d9eed372", "metadata": {}, "outputs": [], "source": [ "df[np.random.rand(df.shape[0]) > 0.5] = 1.5" ] }, { "cell_type": "code", "execution_count": 76, "id": "ae944244", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>0</th>\n", " <th>1</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>-0.089358787</td>\n", " <td>-0.728419386</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>-2.141612003</td>\n", " <td>-0.574415182</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td><NA></td>\n", " <td><NA></td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>0.774643462</td>\n", " <td>2.07287721</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>0.93799853</td>\n", " <td>-1.054129436</td>\n", " </tr>\n", " <tr>\n", " <th>5</th>\n", " <td><NA></td>\n", " <td><NA></td>\n", " </tr>\n", " <tr>\n", " <th>6</th>\n", " <td>-0.435293012</td>\n", " <td>1.163009584</td>\n", " </tr>\n", " <tr>\n", " <th>7</th>\n", " <td>1.346623287</td>\n", " <td>0.31961371</td>\n", " </tr>\n", " <tr>\n", " <th>8</th>\n", " <td><NA></td>\n", " <td><NA></td>\n", " </tr>\n", " <tr>\n", " <th>9</th>\n", " <td><NA></td>\n", " <td><NA></td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " 0 1\n", "0 -0.089358787 -0.728419386\n", "1 -2.141612003 -0.574415182\n", "2 <NA> <NA>\n", "3 0.774643462 2.07287721\n", "4 0.93799853 -1.054129436\n", "5 <NA> <NA>\n", "6 -0.435293012 1.163009584\n", "7 1.346623287 0.31961371\n", "8 <NA> <NA>\n", "9 <NA> <NA>" ] }, "execution_count": 76, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.replace(1.5, None)" ] }, { "cell_type": "markdown", "id": "0f32607c", "metadata": {}, "source": [ "Replacing more than one value is possible by passing a list." ] }, { "cell_type": "code", "execution_count": 77, "id": "59b81c60", "metadata": {}, "outputs": [], "source": [ "df00 = df.iloc[0, 0]" ] }, { "cell_type": "code", "execution_count": 78, "id": "01a71d4c", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>0</th>\n", " <th>1</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>10.000000</td>\n", " <td>-0.728419</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>-2.141612</td>\n", " <td>-0.574415</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>5.000000</td>\n", " <td>5.000000</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>0.774643</td>\n", " <td>2.072877</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>0.937999</td>\n", " <td>-1.054129</td>\n", " </tr>\n", " <tr>\n", " <th>5</th>\n", " <td>5.000000</td>\n", " <td>5.000000</td>\n", " </tr>\n", " <tr>\n", " <th>6</th>\n", " <td>-0.435293</td>\n", " <td>1.163010</td>\n", " </tr>\n", " <tr>\n", " <th>7</th>\n", " <td>1.346623</td>\n", " <td>0.319614</td>\n", " </tr>\n", " <tr>\n", " <th>8</th>\n", " <td>5.000000</td>\n", " <td>5.000000</td>\n", " </tr>\n", " <tr>\n", " <th>9</th>\n", " <td>5.000000</td>\n", " <td>5.000000</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " 0 1\n", "0 10.000000 -0.728419\n", "1 -2.141612 -0.574415\n", "2 5.000000 5.000000\n", "3 0.774643 2.072877\n", "4 0.937999 -1.054129\n", "5 5.000000 5.000000\n", "6 -0.435293 1.163010\n", "7 1.346623 0.319614\n", "8 5.000000 5.000000\n", "9 5.000000 5.000000" ] }, "execution_count": 78, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.replace([1.5, df00], [5, 10])" ] }, { "cell_type": "markdown", "id": "1080e97b", "metadata": {}, "source": [ "You can also operate on the DataFrame in place:" ] }, { "cell_type": "code", "execution_count": 79, "id": "5f0859d7", "metadata": {}, "outputs": [], "source": [ "df.replace(1.5, None, inplace=True)" ] }, { "cell_type": "code", "execution_count": 80, "id": "5cf28369", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>0</th>\n", " <th>1</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>-0.089358787</td>\n", " <td>-0.728419386</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>-2.141612003</td>\n", " <td>-0.574415182</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td><NA></td>\n", " <td><NA></td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>0.774643462</td>\n", " <td>2.07287721</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>0.93799853</td>\n", " <td>-1.054129436</td>\n", " </tr>\n", " <tr>\n", " <th>5</th>\n", " <td><NA></td>\n", " <td><NA></td>\n", " </tr>\n", " <tr>\n", " <th>6</th>\n", " <td>-0.435293012</td>\n", " <td>1.163009584</td>\n", " </tr>\n", " <tr>\n", " <th>7</th>\n", " <td>1.346623287</td>\n", " <td>0.31961371</td>\n", " </tr>\n", " <tr>\n", " <th>8</th>\n", " <td><NA></td>\n", " <td><NA></td>\n", " </tr>\n", " <tr>\n", " <th>9</th>\n", " <td><NA></td>\n", " <td><NA></td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " 0 1\n", "0 -0.089358787 -0.728419386\n", "1 -2.141612003 -0.574415182\n", "2 <NA> <NA>\n", "3 0.774643462 2.07287721\n", "4 0.93799853 -1.054129436\n", "5 <NA> <NA>\n", "6 -0.435293012 1.163009584\n", "7 1.346623287 0.31961371\n", "8 <NA> <NA>\n", "9 <NA> <NA>" ] }, "execution_count": 80, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.13" } }, "nbformat": 4, "nbformat_minor": 5 }