{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Training and Evaluating Machine Learning Models"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This notebook explores several basic machine learning estimators in cuML, demonstrating how to train them and evaluate them with built-in metrics functions. All of the models are trained on synthetic data, generated by cuML's dataset utilities.\n",
    "\n",
    "1. Random Forest Classifier\n",
    "2. UMAP\n",
    "3. DBSCAN\n",
    "4. Linear Regression"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Shared Library Imports"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import cuml\n",
    "from cupy import asnumpy \n",
    "from joblib import dump, load"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Classification"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Random Forest Classification and Accuracy metrics\n",
    "\n",
    "The Random Forest algorithm classification model builds several decision trees, and aggregates each of their outputs to make a prediction. For more information on cuML's implementation of the Random Forest Classification model please refer to : \n",
    "https://docs.rapids.ai/api/cuml/stable/api.html#cuml.ensemble.RandomForestClassifier\n",
    "\n",
    "Accuracy score is the ratio of correct predictions to the total number of predictions. It is used to measure the performance of classification models. \n",
    "For more information on the accuracy score metric please refer to: https://en.wikipedia.org/wiki/Accuracy_and_precision\n",
    "\n",
    "For more information on cuML's implementation of accuracy score metrics please refer to: https://docs.rapids.ai/api/cuml/stable/api.html#cuml.metrics.accuracy.accuracy_score\n",
    "\n",
    "The cell below shows an end to end pipeline of the Random Forest Classification model. Here the dataset was generated by using sklearn's make_classification dataset. The generated dataset was used to train and run predict on the model. Random forest's performance is evaluated and then compared between the values obtained from the cuML and sklearn accuracy metrics."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from cuml.datasets.classification import make_classification\n",
    "from cuml.model_selection import train_test_split\n",
    "from cuml.ensemble import RandomForestClassifier as cuRF\n",
    "from sklearn.metrics import accuracy_score\n",
    "\n",
    "# synthetic dataset dimensions\n",
    "n_samples = 1000\n",
    "n_features = 10\n",
    "n_classes = 2\n",
    "\n",
    "# random forest depth and size\n",
    "n_estimators = 25\n",
    "max_depth = 10\n",
    "\n",
    "# generate synthetic data [ binary classification task ]\n",
    "X, y = make_classification ( n_classes = n_classes,\n",
    "                             n_features = n_features,\n",
    "                             n_samples = n_samples,\n",
    "                             random_state = 0 )\n",
    "\n",
    "X_train, X_test, y_train, y_test = train_test_split( X, y, random_state = 0 )\n",
    "\n",
    "model = cuRF( max_depth = max_depth, \n",
    "              n_estimators = n_estimators,\n",
    "              random_state  = 0 )\n",
    "\n",
    "trained_RF = model.fit ( X_train, y_train )\n",
    "\n",
    "predictions = model.predict ( X_test )\n",
    "\n",
    "cu_score = cuml.metrics.accuracy_score( y_test, predictions )\n",
    "sk_score = accuracy_score( asnumpy( y_test ), asnumpy( predictions ) )\n",
    "\n",
    "print( \" cuml accuracy: \", cu_score )\n",
    "print( \" sklearn accuracy : \", sk_score )\n",
    "\n",
    "# save \n",
    "dump( trained_RF, 'RF.model')\n",
    "\n",
    "# to reload the model uncomment the line below \n",
    "loaded_model = load('RF.model')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Clustering"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### UMAP and Trustworthiness metrics\n",
    "UMAP is a dimensionality reduction algorithm which performs non-linear dimension reduction. It can also be used for visualization.\n",
    "For additional information on the UMAP model please refer to the documentation on https://docs.rapids.ai/api/cuml/stable/api.html#cuml.UMAP\n",
    "\n",
    "Trustworthiness is a measure of the extent to which the local structure is retained in the embedding of the model. Therefore, if a sample predicted by the model lay within the unexpected region of the nearest neighbors, then those samples would be penalized. For more information on the trustworthiness metric please refer to: https://scikit-learn.org/dev/modules/generated/sklearn.manifold.t_sne.trustworthiness.html\n",
    "\n",
    "the documentation for cuML's implementation of the trustworthiness metric is: https://docs.rapids.ai/api/cuml/stable/api.html#cuml.metrics.trustworthiness.trustworthiness\n",
    "\n",
    "The cell below shows an end to end pipeline of UMAP model. Here, the blobs dataset is created by cuml's equivalent of make_blobs function to be used as the input. The output of UMAP's fit_transform is evaluated using the trustworthiness function. The values obtained by sklearn and cuml's trustworthiness are compared below.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from cuml.datasets import make_blobs\n",
    "from cuml.manifold.umap import UMAP as cuUMAP\n",
    "from sklearn.manifold import trustworthiness\n",
    "import numpy as np\n",
    "\n",
    "n_samples = 1000\n",
    "n_features = 100\n",
    "cluster_std = 0.1\n",
    "\n",
    "X_blobs, y_blobs = make_blobs( n_samples = n_samples,\n",
    "                               cluster_std = cluster_std,\n",
    "                               n_features = n_features,\n",
    "                               random_state = 0,\n",
    "                               dtype=np.float32 )\n",
    "\n",
    "trained_UMAP = cuUMAP( n_neighbors = 10 ).fit( X_blobs )\n",
    "X_embedded = trained_UMAP.transform( X_blobs )\n",
    "                                            \n",
    "cu_score = cuml.metrics.trustworthiness( X_blobs, X_embedded )\n",
    "sk_score = trustworthiness( asnumpy( X_blobs ),  asnumpy( X_embedded ) )\n",
    "\n",
    "print(\" cuml's trustworthiness score : \", cu_score )\n",
    "print(\" sklearn's trustworthiness score : \", sk_score )\n",
    "\n",
    "# save\n",
    "dump( trained_UMAP, 'UMAP.model')\n",
    "\n",
    "# to reload the model uncomment the line below \n",
    "# loaded_model = load('UMAP.model')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### DBSCAN and Adjusted Random Index\n",
    "DBSCAN is a popular and a powerful clustering algorithm.  For additional information on the DBSCAN model please refer to the documentation on https://docs.rapids.ai/api/cuml/stable/api.html#cuml.DBSCAN\n",
    "\n",
    "We create the blobs dataset using the cuml equivalent of make_blobs function.\n",
    "\n",
    "Adjusted random index is a metric which is used to measure the similarity between two data clusters, and it is adjusted to take into consideration the chance grouping of elements.\n",
    "For more information on Adjusted random index please refer to: https://en.wikipedia.org/wiki/Rand_index\n",
    "\n",
    "The cell below shows an end to end model of DBSCAN. The output of DBSCAN's fit_predict is evaluated using the Adjusted Random Index function. The values obtained by sklearn and cuml's adjusted random metric are compared below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from cuml.datasets import make_blobs\n",
    "from cuml import DBSCAN as cumlDBSCAN\n",
    "from sklearn.metrics import adjusted_rand_score\n",
    "import numpy as np\n",
    "\n",
    "n_samples = 1000\n",
    "n_features = 100\n",
    "cluster_std = 0.1\n",
    "\n",
    "X_blobs, y_blobs = make_blobs( n_samples = n_samples, \n",
    "                               n_features = n_features, \n",
    "                               cluster_std = cluster_std,                               \n",
    "                               random_state = 0,\n",
    "                               dtype=np.float32 )\n",
    "\n",
    "cuml_dbscan = cumlDBSCAN( eps = 3, \n",
    "                          min_samples = 2)\n",
    "\n",
    "trained_DBSCAN = cuml_dbscan.fit( X_blobs )\n",
    "\n",
    "cu_y_pred = trained_DBSCAN.fit_predict ( X_blobs )\n",
    "\n",
    "cu_adjusted_rand_index = cuml.metrics.cluster.adjusted_rand_score( y_blobs, cu_y_pred )\n",
    "sk_adjusted_rand_index = adjusted_rand_score( asnumpy(y_blobs), asnumpy(cu_y_pred) )\n",
    "\n",
    "print(\" cuml's adjusted random index score : \", cu_adjusted_rand_index)\n",
    "print(\" sklearn's adjusted random index score : \", sk_adjusted_rand_index)\n",
    "\n",
    "# save and optionally reload\n",
    "dump( trained_DBSCAN, 'DBSCAN.model')\n",
    "\n",
    "# to reload the model uncomment the line below \n",
    "# loaded_model = load('DBSCAN.model')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Regression"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Linear regression and  R^2 score\n",
    "Linear Regression is a simple machine learning model where the response y is modelled by a linear combination of the predictors in X.\n",
    "\n",
    "R^2 score is also known as the coefficient of determination. It is used as a metric for scoring regression models. It scores the output of the model based on the proportion of total variation of the model.\n",
    "For more information on the R^2 score metrics please refer to: https://en.wikipedia.org/wiki/Coefficient_of_determination\n",
    "\n",
    "For more information on cuML's implementation of the r2 score metrics please refer to : https://docs.rapids.ai/api/cuml/stable/api.html#cuml.metrics.regression.r2_score\n",
    "\n",
    "The cell below uses the Linear Regression model to compare the results between cuML and sklearn trustworthiness metric. For more information on cuML's implementation of the Linear Regression model please refer to : \n",
    "https://docs.rapids.ai/api/cuml/stable/api.html#linear-regression"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from cuml.datasets import make_regression\n",
    "from cuml.model_selection import train_test_split\n",
    "from cuml.linear_model import LinearRegression as cuLR\n",
    "from sklearn.metrics import r2_score\n",
    "\n",
    "n_samples = 2**10\n",
    "n_features = 100\n",
    "n_info = 70\n",
    "\n",
    "X_reg, y_reg = make_regression( n_samples = n_samples, \n",
    "                                n_features = n_features,\n",
    "                                n_informative = n_info, \n",
    "                                random_state = 123 )\n",
    "\n",
    "X_reg_train, X_reg_test, y_reg_train, y_reg_test = train_test_split( X_reg,\n",
    "                                                                     y_reg, \n",
    "                                                                     train_size = 0.8,\n",
    "                                                                     random_state = 10 )\n",
    "cuml_reg_model = cuLR( fit_intercept = True,\n",
    "                       normalize = True,\n",
    "                       algorithm = 'eig' )\n",
    "\n",
    "trained_LR = cuml_reg_model.fit( X_reg_train, y_reg_train )\n",
    "cu_preds = trained_LR.predict( X_reg_test )\n",
    "\n",
    "cu_r2 = cuml.metrics.r2_score( y_reg_test, cu_preds )\n",
    "sk_r2 = r2_score( asnumpy( y_reg_test ), asnumpy( cu_preds ) )\n",
    "\n",
    "print(\"cuml's r2 score : \", cu_r2)\n",
    "print(\"sklearn's r2 score : \", sk_r2)\n",
    "\n",
    "# save and reload \n",
    "dump( trained_LR, 'LR.model')         \n",
    "\n",
    "# to reload the model uncomment the line below \n",
    "# loaded_model = load('LR.model')"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}