Predicting Credit Card Fraud From Transaction Information

14 min readMay 8, 2021

Authored by: Takuma Fujiwara, Adrian Cantu Garza, Raed Asdi, Nikhil Jalla, Samuel Rizzo, and Aly Ashraf

Introduction and Background

For our final project, we decided to train a model for the Kaggle competition IEEE-CIS Fraud Detection. This competition used a data set made of real-world e-commerce transactions provided by the world’s leading payment service company, the Vesta Corporation. As the name of the competition suggests, the purpose of the competition was to create a model that would accurately predict if a credit card transaction was indeed fraudulent, based on a plethora of features that will be explained in the following section.

About the Data

The data for this competition was split up into two parts: the transaction table and the identity table. The transaction table keeps a detailed list of all information involved in a transaction between two parties. Features in this table include transaction time (TransactionDT), transaction amount (TransactionAMT), product code (ProductCD), information about the card used in the transaction (card1 — card6), address (adr), distance (dist), and email domain for purchaser and recipient (R_ and P_). The transaction table also includes other masked features that were masked by the Vesta Corporation to protect client privacy. So while they have no practical context to us, they still impact the predictions of our model.

The identity table keeps track of network information and digital signatures of users. Features in this table include information such as IP, ISP, browser, os, device type, and device info. Much like the masked features in the transaction table, almost all of the features in the identity table are masked to protect client privacy. The main purpose of having an identity table is to be able to identify potential fraudulent users. When a fraudulent transaction has been found, the user information that is associated with that transaction can then be flagged for having a higher risk of future fraudulent transactions.

EDA (Exploratory Data Analysis)

One detail about the data that stood out was the incredible imbalance of fraud vs. not fraud data points. In the training set, there were about 570,000 cases of non-fraud data and only 20,000 cases of fraudulent transactions (Approximately a 32:1 ratio). Normally, this can lead us to think our model does well when evaluated on accuracy. This happens because models will see this vast class imbalance and just label the most common class every time. This is not good as our models will just end up overfitting to that class. To combat this, we evaluated our models on AUC rather than Accuracy.

This issue of class imbalance can be attributed to Vesta’s labeling of fraud. According to their discussion post,

“The logic of our labeling is define reported chargeback on the card as fraud transaction (isFraud=1) and transactions posterior to it with either user account, email address or billing address directly linked to these attributes as fraud too. If none of the above is reported and found beyond 120 days, then we define it as legit transaction (isFraud=0).”

With this definition, even Vesta admits that, in the real world, credit card fraud could potentially go unnoticed, however, since we never knew they are fraud, they are ignored as they claim that these cases are “unusual” and “negligible”.

There was also a lot of missing data, almost every column of the dataset had at least one missing value, and this is to be expected since not all the data can be collected effectively in a situation like credit card fraud. With missing data, we just filled it with 0, because it was much too heavy on our resources to fill with another value such as the mean or median.

Another interesting aspect of the data was that either a user would only commit fraudulent transactions, or only non-fraudulent transactions. A very small percentage would have a mix of both. This is illustrated in the pie chart below, notice how only 0.2% of users have a mix of fraudulent and non-fraudulent transactions in their history.

Preprocessing

Feature Selection

XGBoost to select features

The initial dataset consists of 429 features, this number increases to 578 once we use one-hot encoding for the categorical features. Given that our training dataset has around 600,000 data points, the training of our models would have taken a very long time to train if we used all of this data, especially when doing cross-validation to tune the hyperparameters. Due to the time constraints and the limited computation power that we have, we decided that it would be beneficial to get rid of the less important features.

One of the ways in which we achieved this was by first training an XGBoost model using all of the data. XGBoost automatically calculates the importance of each feature when the model is trained, so we were able to easily extract these features by simply using the ‘feature_importances_’ feature of the XGBoost library.

Above is a plot of the importance of each feature.

Once we had the importance of all features given by a positive number between one and zero, the next step was to figure out the best threshold to decide how important a feature needs to be to keep it. The key point when finding this threshold was to have a balance between the amount of useful information lost and the amount of data that we can get rid of. In the end, we decided to choose a threshold of 0.0001. This allowed us to bring down the number of features to only 234, without losing any significant amount of performance in our model. To compare how this reduced number of features affected the model we trained two XGBoost models with the original and with the reduced training data. The original model gave us an AUC score of 0.868, while the model trained after feature selection had an AUC score of 0.866.

Although feature selection did reduce the score, having this smaller dataset allowed us to increase our score by making it possible to have a more in-depth hyperparameter tuning and letting us test more models in a shorter amount of time.

Lasso Use

To confirm the features selected by XGBoost, we also ran a LassoCV model since this algorithm also has automatic feature selection. As such the conclusions drawn from this are strictly for extra validation.

First, we went through the categorical features, filling all null values with a 0. As we mentioned earlier, this was an optimal solution. Then we encoded the remaining categorical strings into integers using a LabelEncoder object. This was chosen mainly due to its speed and accomplishment of converting the categorical features into numerical values that the model prefers.

After this was done, we did a train/test split for later scoring. Then we fit the LassoCV model to the training data, we used several alpha values to find the best one, this code ran fairly quickly compared to our actual model training and cross-validation. We ended with an RMSE of about 0.16675.

Now is the part that we wanted to see, below is an image of the top 60 features that Lasso picked.

This chart shows the top 60 values picked by the Lasso Model for consideration. As we can see, V189 and V240 are the most important features according to this model. We can then use this data and compare it to our XGBoost selections to see if the list of feature importances is similar.

Overall both models selected similar parameters, however, the XGBoost model seemed to eliminate a few of the ‘card’ variables. This could potentially just be due to differences in algorithmic calculations.

After some more research of discussion posts on Kaggle, some people had figured out that the different card variables were dependent on each other since they were just different values of the same credit card. So this difference in selection is fine.

Feature Engineering

We engineered new features using information about the data given by Vesta, the provider of the dataset. Observations made by exploratory data analysis were also used to engineer new features.

At the start of the project, we learned from exploratory data analysis that only a small percentage of the data is labeled isFraud=1. Additionally, we learned from discussion boards on Kaggle that a user either always does fraudulent transactions or always legitimate transactions, and it is very rare for users to have a mix of fraudulent and legitimate transactions associated with them.

This means that this competition wasn’t really about predicting whether a transaction was fraudulent, but rather about predicting whether a user was fraudulent.

With the above information in mind, here were 3 main things we did to engineer new features:

Linking transactions to individual user identity
Creating timestamp columns out of time delta columns
One-hot encoding categorical features

The original data is split into two different tables: identity and transaction. The identity file contains masked information about the user, while the transaction file contains information about the individual transaction. The two files were merged on the feature TransactionID. Though the identity file contains information about the user, all of the data is masked and there are no columns that directly identify an individual user due to privacy reasons. UID (Unique ID), which is a feature that directly identifies an individual user, is engineered by combining existing features. The features used to combine was chosen by the following method:

Labeling data in the training set as inTraining=1, and data in testing set as inTraining=0.
Mixing the two sets.
Training a model that predicts inTraining by using all other features.
Finding the best feature used to predict by looking at feature_importances_

By training a model that identifies whether a transaction belongs to the training set or the testing set, we can find which features are the most important in identifying the specific user associated with a transaction.

Next, we created new columns from the D columns. The D columns are 15 columns that contain time delta, such as days between previous transactions. We created a new column that contained a timestamp instead of a time delta by using the D columns and TransactionDT (which stands for transaction date). From discussion boards, we learned that TransactionDT is in seconds because TransactionDT first value is 86400, which corresponds to the number of seconds in a day (60 * 60 * 24 = 86400). We took the difference of TransactionDT (which was converted to the unit of the D columns) and the D columns to create a timestamp column.

Lastly, using the information about the data provided by Vesta, we one-hot encoded the categorical features.

Categorical features in the transaction table

ProductCD
card1 — card6
addr1, addr2
P_emaildomain
R_emaildomain
M1 — M9

Categorical features in the identity table:

DeviceType
DeviceInfo
id_12 — id_38

The feature engineering resulted in an improvement of our score by around 0.01.

Models Used

XGBoost

The first algorithm we decided to use was XGBoost. XGBoost is an open-source, regularized, gradient boosting algorithm designed for machine learning applications. Some advantages of using XGboost include a regularization term to help smooth final weights and avoid overfitting and shrinkage. Shrinkage scales each new weight by a factor of n to reduce the influence of each tree.

Before tuning the hyperparameters, the default XGBoost model achieved an AUC score of .944.

A gridsearch with the following parameters was used to determine the best hyperparameters for our XGBoost model:

param_grid = {‘max_depth’:[1,10,50,100],

‘n_estimators’:[10,50,100,200],

‘learning_rate’:[.25,.5,.75],

‘booster’:[‘gbtree’,’gblinear’,’dart’]}

XGBoost took much longer to run than the other two models we tested. So to compensate for this, we reduced the size of the gridsearch and enabled cross-validation to run on all cores.

The gridsearch revealed the best hyperparameters to be:

param_grid = {‘max_depth’: 50,

‘n_estimators’: 200,

‘learning_rate’: .5,

‘Booster’:’gbtree’}

Using our newfound hyperparameters, the fully tuned XGBoost model achieved an AUC score of .963.

LightGBM

Another algorithm that we decided to use was LightGBM, this is a decision tree-based gradient boosting method originally developed by Microsoft. We decided to use this model because not only does it typically result in a great performance in this type of problem, but it is also fast and scalable. Training a LightXGB model with our data only took around 10 seconds, this allowed us to try many different hyperparameters and thus end up with a very good model.

To get an initial idea of how this model was going to perform we trained a LightGBM model with default parameters. This model gave us an AUC score of 0.923.

To choose the best hyperparameters we used cross-validation with the help of a grid search.

param_grid = {‘min_data_in_leaf’:[20,30,50,100],

‘boosting_type’:[‘gbdt’,’gbrt’,’gbm’],

‘num_leaves’:[30,50,75,100,150,200],

‘learning_rate’:[0.01,0.05,0.1,0.15,0.17],

‘feature_fraction’:[0.5, 0.7, 0.9, 1.0] }

In addition to these parameters we were also able to specify the number of rounds that our model was going to run while it was training. We found that around 100 rounds was optimal. Any more and the model would start to overfit, any less and the model would underfit the data.

Thanks to the speed of LightGBM finding the optimal parameters were relatively fast compared to other models that we tried. It took around three hours to complete the grid search.

At the end, the best hyperparameters we found were the following:

params = {‘min_data_in_leaf’: 20,

‘boosting_type’: ‘gbdt’,

‘num_leaves’: 200,

‘learning_rate’: 0.15,

‘feature_fraction’: .9 }

Using these found hyperparameters we retrained our model and were able to get an AUC score of 0.9639.

CatBoost

CatBoost is a gradient-descent-based boosting algorithm that is known for a few of its great and unique qualities. Firstly, it offers great performance even on default parameters. This can sometimes reduce the need for parameter tuning in some cases if the model performs extremely well without any parameter tuning. Secondly, it also offers GPU-enabled hardware acceleration for extremely fast learning. According to Yandex’s benchmarks, on CPU learning, CatBoost only took around 527 seconds, whereas other algorithms like XGBoost and LightGBM took much longer (4339 and 1146 seconds respectively). On GPU hardware acceleration it was even faster, only taking 18 seconds. This benchmark was on a dataset with 400,000 samples and 2000 features. The model parameters were set to 128 bins, 64 leaves, and 400 iterations.

CatBoost’s unique design allows it to incorporate and learn from multiple data sources very well. As opposed to traditional deep-learning algorithms which learn better from homogeneous datasets. This makes CatBoost extremely useful in cases like weather prediction, where multiple sources of data such as historical data, weather models, and other data.

As such, we expected a good result from cross-validation. However, when running a grid search with the following parameter grid, we noticed an issue.

param_grid = {‘depth’:[3,1,2,6,4,5,7,8,9,10],

‘iterations’:[250,100,500,1000],

‘learning_rate’:[0.03,0.001,0.01,0.1,0.2,0.3],

‘l2_leaf_reg’:[3,1,5,10,100],

‘border_count’:[32,5,10,20,50,100,200]}

The grid search was taking extremely long. Even at the time of writing this blog the grid search is still running. Maybe not having access to GPU acceleration along with the high RAM cost of this dataset is why we couldn’t run this on-grid search. Thus, we moved towards a random search instead which only took about an hour or two compared to the — at the time of writing this — 25-hour run that grid search has been on.

Now we move to the parameters tuned, firstly was depth. This is a relatively simple parameter that can stop a tree at a certain depth to avoid overfitting. This parameter proves to be extremely useful in the case of this dataset, due to the vast class imbalance. By trimming the tree to a certain depth, we can somewhat avoid the case of always selecting the most common class.

Next was the number of iterations. This is simply the number of iterations that the gradient descent algorithm goes through to find the lowest possible error on the loss function. In the image below we can see the model attempting to find the best possible split for classification. Every iteration tries a new split based on the data input. Usually the more iterations of this, the better the results are, however, at a certain point the iterations become redundant.

Another hyperparameter was learning rate, as to be expected with a gradient boosting algorithm. Learning rate defines the size of the step that the gradient boosting algorithm takes. CatBoost is great since the learning rate is usually automatically determined by the data and its properties. We still sought to experiment and see which learning rate worked best.

The L2 regularization parameter seeks to optimize the following equation;

The yellow highlighted term is the regularization term, denoted by the lambda in front. That lambda is what we are adjusting to. A term too high leads to underfitting since it will add way too much to the error on wrong guesses. This parameter is useful since we don’t want to overfit the imbalanced data that we have.

Lastly was the border count parameter. For this parameter, usually the higher it is, the better, however, we still wanted to find an optimal number that would work without being the maximum. This also greatly affects training time so we wanted to optimize for the best results in a shorter amount of time. However, in situations where quality is of great importance, then set it to a max of 254 is better.

The resulting parameters are what won out in the random grid search.

param_grid = {‘depth’:[9],

‘iterations’:[1000],

‘learning_rate’:[0.2],

‘l2_leaf_reg’:[5],

‘border_count’:[32]}

Stacking and Final Performance

Lastly, we did stacking, which is an ensembling method that combines different models to create one optimal model. The models we stacked were XGBoost, LightGBM, and CatBoost, which were tuned in earlier stages in the project. Predictions from the tuned model are used to train and predict final prediction using LogisticRegression. The final AUC score that we got was 0.9685, which was the best score out of all the previous models.

Conclusion and Takeaways

Based on the models and methods used, our project was essentially a deep dive into making models for Kaggle competitions. Our preprocessing consisted of some feature selection and feature engineering. Feature selection was essential for our project because the original dataset consists of over 400 columns and 1 million rows when we combined the training and testing sets, so without feature selection, our models will take an absurd amount of time for training. For feature selection, we determined feature importance by using the importance calculated by XGBoost, and validated what XGBoost calculated with a Lasso Regression. Feature engineering was also an essential aspect of our project because we learned that though the competition was to predict whether a transaction was fraudulent or not, it was rather more about predicting whether a user was fraudulent or not. For feature engineering, we generated more useful features by mapping transactions to a single user, creating a timestamp column out of the ‘time delta’ columns, and one-hot encoding the categorical data. After preprocessing, we tuned hyperparameters for three different models: LightGBM, CatBoot, and XGBoost. Of the three, LightGBM performed the best with an AUC ROC score of 0.9639. We later improved this score by stacking all three tuned models to produce a final AUC ROC score of 0.9685 on the training set. On the Kaggle competition, we were able to obtain a final score of 0.931028 using our stacked model. There is a great drop in the score when we compare the Kaggle score to the score obtained on the training set. We believe the reason for this drop is that our model is overfitting the training set. Therefore, if we were to continue this project in the future, our next step would be to find methods to not overfit, such as early stopping, more feature engineering, and more tuning of hyperparameters.