Fraud Detection using Online Incremental Learning vs Batch Learning

December 27, 2023

Imagine running a bank and suddenly spotting unusual credit card transactions. Fortunately, you catch it in time, even with the help of an employee. Initially, you might have only a few customers, but later, it could be millions or even billions. In the past, people dealt with this using traditional rules, but fraudsters have evolved. Relying on those old methods is risky. What’s needed is a robust system that swiftly adapts to new fraud tricks. In the realm of Machine Learning, we call this Online Incremental Learning—a method to detect these tricky transactions.

This article provides a demonstration of Online Incremental Learning for Fraud Detection, showcasing its importance in staying ahead of evolving fraud patterns.

Online Incremental Learning

Traditional Machine Learning vs Online Machine Learning

In simple words, this is a process to continuously update the model with new knowledge as a new transaction arrives, much like how we humans learn new things every day. To explain with an example, consider this as a new chef who is learning to cook while adding ingredients to a simmering pot, constantly tasting and adjusting.

In contrast to traditional ML (Offline Learning) where the model is trained with the entire data at once, Online Incremental Learning continuously ingests real-time one transaction of the customer at a time.

Need for Online Incremental Learning

The answer to this question is simple: change is constant. In fields like fraud detection, patterns change rapidly. Traditional learning models require retraining with the entire dataset, which is time-consuming and impractical.

Online Incremental Learning steps in as a game-changer, learning from new data on the fly. The key idea is that the model can be adapted to newer frauds without forgetting the learning from previous frauds. This ensures the creation of a robust system on the go while detecting fraudulent transactions. Benefit and Limitations are explored in the subsequent sections.

How does it work?

River is a Python library designed for Online Incremental machine learning, providing support for various machine learning tasks such as regression, classification, and unsupervised learning. Additionally, it is versatile enough to handle ad-hoc tasks like calculating online metrics and detecting concept drift.

Each tool within the library is capable of being updated with just one observation at a time, making it suitable for processing streaming data. Depending on your specific use case, this approach may offer greater convenience compared to utilizing a batch model.

Let’s see how the process can be broken down into steps with an example.

Online Incremental Learning for Fraud Detection using RiverML

Install library

pip install river

Import Libraries

from river import metrics
from river import preprocessing
from river import datasets
from river import anomaly
from river import compose

Initialize Data Source

In production, use an event streaming platform like Kafka for high-throughput transaction data. This facilitates real-time data integration into online incremental learning. In this blog, we’ve simulated event streaming with a focused sample of 50,000 transactions.

streaming_dataset = datasets.CreditCard().take(50_000)

Building a Model

Before diving into the model construction, it’s crucial to address a few aspects:

Data Pre-processing: In this step, traditional ML processes are adapted for Online Incremental Learning using River’s on-the-fly adaptable pre-processing functions.

Incremental/Online Learning: Continuous training of the model on data from the Data Stream is essential. For anomaly detection, we opt for HalfSpaceTrees in the River Library, similar to Isolation Forest but trainable incrementally for ongoing adaptability against new fraudulent activities.

Metrics: Employ the ROCAUC metric from River’s metrics module to evaluate the incremental learning model’s performance. Integrated into the pipeline, it ensures continuous monitoring and refinement over time.

The components mentioned above can be composed as demonstrated below:

model = compose.Pipeline(
    preprocessing.MinMaxScaler(),  #Data Preprocessing learning on the fly
    anomaly.HalfSpaceTrees()         #Online Incremental ML model
)

auc = metrics.ROCAUC()
auc_plt = []                       #to track and plot the ROCAUC

This example we used is pretty basic. To make it work even better, you can do a few more things like adjusting the data balance with Under/Over Sampling and fine-tuning the model’s settings, which we call hyperparameters. These extra steps can make a big difference in how well the model performs by handling imbalances in the data and making sure the model is set up just right. It shows that Online Incremental Learning can be improved and customized based on the unique features of the data you’re working with.

Model Prediction and Learning

The code simulates a Kafka-like data stream by looping over a streaming dataset. Within the loop, the model predicts anomaly scores with score_one and learns from each sample using learn_one. The ROCAUC metric is updated with true labels and predicted scores using the update method. This code exemplifies an online incremental learning approach in a streaming data setting.

for i, (x, y) in enumerate(streaming_dataset): #simulating the Kafka like Stream
  score = model.score_one(x)    #predict
  model.learn_one(x)           #learn from a sample
  auc.update(y, score)          #update the metric
  auc_plt.append(auc.get())    #to track and plot the ROCAUC

Analyzing Metrics

Improvement of the ROCAUC metric over time through online incremental learning.

The above plot visually demonstrates the improvement of the ROCAUC metric over time through online incremental learning. Starting at 0, the metric steadily increases, reaching a final value of 0.95. This visual evidence highlights the model’s effective adaptation and enhanced anomaly detection capabilities as it learns from incoming data.

Benefits and Limitations

Let’s now dive into the intriguing benefits and limitations of Online Incremental Learning. It is important to understand this as some negative aspects might be more relevant than the positive aspects of considering this method

Benefits

Adaptability: It can keep up with changing data trends.
Efficiency: Saves computational resources as it doesn’t need the entire dataset for retraining. So, instead of loading the whole dataset into memory for processing and training, it can take one transaction at a time.
Real-Time Learning: Ideal for applications where immediate learning from new data is crucial.
Scalability: Ideal for large, continuously growing datasets.

Limitations

Data Quality: Online Incremental Learning is only as good as the incoming data.
Complexity: Balancing learning speed and model stability can be tricky.
Complexity in Error Correction: Mistakes made in earlier learning can propagate if not addressed.

Slight Help with Advanced Strategy

As a bonus, let’s discuss some advanced strategies that can help to deal with real-world problems.

Dealing with Concept Drift: Concept drift occurs when the statistical properties of the target variable change over time. River supports tools like ‘ADWIN’ and ‘PageHinkley’ for detecting concept drift and alerting the system to update the model for significant changes.
Online Feature Selection: Not all features are created equal, especially in a changing environment like Financial transactions. Online feature selection can help in identifying the most relevant features in real-time. River can offer methods like SelectKBest which continuously evaluates and selects the top-performing features.

Conclusion

In conclusion, this article highlights the pivotal role of Online Incremental Learning in real-time fraud detection, addressing the challenges posed by evolving patterns in large datasets. Demonstrated through the implementation of the River Python library, specifically designed for Online Incremental Learning, the model showcased continuous learning and adaptation to streaming data. Utilizing the HalfSpaceTrees model and the ROCAUC metric, the analysis section visually portrayed a substantial improvement over time. While Online Incremental Learning offers benefits such as adaptability, efficiency, and scalability, it requires careful consideration of data quality and the complexity of maintaining model stability. Additionally, advanced strategies for handling concept drift and online feature selection were discussed. Overall, Online Incremental Learning emerges as a crucial tool for real-time fraud detection, ensuring the adaptability and effectiveness of anomaly detection systems in dynamic environments.

Subscribe to our Newsletters and Stay tuned for more interesting topics.

Contact Us Today!

Authors

Suman Michael

Michael, Technical Director for R&D at HexaCluster, with a focus on machine learning (ML), deep learning (DL), and generative AI (GenAI), brings a wealth of expertise to the table. With a mastery of languages such as C, Go, Rust, Java, Python, and JavaScript, he excels in crafting robust, data-intensive, and concurrent systems. Michael’s proficiency extends to PostgreSQL development and administration, showcasing his well-rounded technical prowess. A devoted advocate of open source, he remains actively engaged in contributing to its community, further enriching the collaborative landscape of technology.
Shubham Barthwal

Shubham is a distinguished Machine Learning Developer with a Master’s degree in Artificial Intelligence. He has expertise in Computer Vision, Text Classification, Deep Learning, and remarkable problem-solving skills. Throughout his career, he has been instrumental in developing and implementing innovative AI solutions, particularly excelling in optimizing complex processes and architecting sophisticated systems. His passion is deeply rooted in delivering robust, real-world solutions while continually honing his skills and knowledge in the ever-evolving landscape of AI and Machine Learning.