*Francisca Dias*

This report was done with the goal to **show how to deal with highly imbalanced data.**

For this purpose I will work on a dataset that contains transactions made by credit cards. There are 284,807 transactions in which 492 are fraud. This dataset is therefore highly imbalanced: the positive class (frauds) only account for **0.172%** of all transactions.

Imbalanced data is common to classification problems: classes are not represented equally most of the time.

Here I will show that excellent accuracy scores (close to 100%) do not reflect the true value of model: it only reflects the underlying class distribution.

As we will see in a short time, the reason we get close to 100% accuracy on an imbalanced data is because our models look at the data and cleverly decide that the best thing to do is to always predict “Not Fraud” and achieve high accuracy.

To overcome this problem I will look at other performance metrics beyond "accuracy":

- Confusion Matrix

- Classification Report (precision, recall and f1 score) and

- Precision-Recall

Decision trees often perform well on imbalanced datasets so I will use two classifiers: CART and Random Forest.

The datasets contains transactions made by credit cards in September 2013 by european cardholders. These transactions occurred in two days time.

Here we have **492 frauds** out of **284,807 transactions:** the positive class (frauds) account for **0.172%** of all transactions.

It contains only numerical input variables which are the result of a PCA transformation (PCA transformation is a data reduction technique that transforms the dataset into a compressed form).

Due to confidentiality issues the original features are not provided.

Features V1, V2, ... V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'.

Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset.

The feature 'Amount' is the transaction Amount, this feature can be used for example-dependant cost-senstive learning.

Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.

This dataset can be found here.

In [1]:

```
import numpy as np
import pandas as pd
df = pd.read_csv('creditcard.csv')
df.head()
```

Out[1]:

In [2]:

```
df.shape
```

Out[2]:

We can see that all of the attributes are numeric (float).

In [3]:

```
df.dtypes
```

Out[3]:

We can see that the classes are unbalanced between 0 and 1. fraud and not fraud.

In [4]:

```
df.groupby('Class').size()
```

Out[4]:

The most commonly used measure of classifier performance is **accuracy**: the percent of correct classifications obtained.

Although accuracy is an easy metric to understand and compare different classifiers, it ignores factors that should be taken into account when honestly assessing the performance of a classifier.

In [5]:

```
from sklearn.model_selection import train_test_split
array = df.values
X = array[:,0:30]
y = array[:,30]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=7)
```

In [6]:

```
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier
model_cart = BaggingClassifier()
model_rf = RandomForestClassifier()
model_cart.fit(X_train, y_train)
model_rf.fit(X_train, y_train)
predicted_cart = model_cart.predict(X_test)
predicted_rf = model_rf.predict(X_test)
print('CART score is:',model_cart.score(X_test, y_test))
print('Random Forest score is:',model_rf.score(X_test, y_test))
```

**0.999 in accuracy**. But this is just the underlying class distribution. We need to see beyond this metric.

For this problem, imagine that we predicted that a transaction was fraudulent and it turns out that wasn't - **False Positve.** This is a mistake, but not a serious one.

BUT, what if we predicted that a transaction was not fraudulent and it turn out to be fraud? - **False Negative.** This is a big mistake with great 'financial 'consequences.

We need a method which will take into account all of these numbers.

In [7]:

```
confusion_cart = pd.crosstab(y_test, predicted_cart, rownames=['Actual'], colnames=['Predicted'], margins=True)
confusion_cart
```

Out[7]:

In [8]:

```
confusion_rf = pd.crosstab(y_test, predicted_rf, rownames=['Actual'], colnames=['Predicted'], margins=True)
confusion_rf
```

Out[8]:

We are trying to **minimize the False Negatives.** :

Random Forest (FN = 20) seems to perform better than CART (FN = 21).

In [9]:

```
from sklearn.metrics import classification_report
# Cart
report_cart = classification_report(y_test, predicted_cart)
print(report_cart)
# Random Forest
report_rf = classification_report(y_test, predicted_rf)
print(report_rf)
```

In [15]:

```
from sklearn.metrics import average_precision_score
average_precision_Cart = average_precision_score(y_test, predicted_cart)
average_precision_RF = average_precision_score(y_test, predicted_rf)
print('Average precision-recall score for CART: {0:0.2f}'.format(average_precision_Cart))
print('Average precision-recall score for Random Forest: {0:0.2f}'.format(average_precision_RF))
```

The official documentation can be found here.

Precision-Recall is a useful measure of success of prediction when the classes are very imbalanced. In information retrieval:

**precision** is a measure of result relevancy

while **recall** is a measure of how many truly relevant results are returned.

The **precision-recall curve** shows the tradeoff between precision and recall for different threshold. A high area under the curve represents both high recall and high precision, where high precision relates to a low false positive rate, and high recall relates to a low false negative rate. High scores for both show that the classifier is returning accurate results (high precision), as well as returning a majority of all positive results (high recall).

A system with high recall but low precision returns many results, but most of its predicted labels are incorrect when compared to the training labels. A system with high precision but low recall is just the opposite, returning very few results, but most of its predicted labels are correct when compared to the training labels. An ideal system with high precision and high recall will return many results, with all results labeled correctly.

Precision (P) is defined as the number of true positives (T_p) over the number of true positives plus the number of false positives (F_p).

In [ ]:

```
```