form once you have completed the tutorial
What You Will Learn
- Handcraft a credit card fraud detection model.
- How to decentralize the model and run it with FLock Client.
Technology Stack
- Python ^3.10
- Docker@latest
Prerequisites
- CUDA+CUDnn
- Version: Latest
Setup
- credit_card_fraud_detection.
mkdir credit_card_fraud_detection
cd credit_card_fraud_detection
- Create an environmental dependencies list file named
requirements.txt
inside thecredit_card_fraud_detection
directory - Add the following within the requirements.txt:
--find-links <https://download.pytorch.org/whl/torch_stable.html>
# CUDA 4.76GB
# torch==2.0.1+cu117; sys_platform == 'linux'
# torchvision==0.15.2+cu117; sys_platform == 'linux'
# CPU 1.85GB
# torch==2.0.1+cpu; sys_platform == 'linux'
# torchvision==0.15.2+cpu; sys_platform == 'linux'
torch==2.0.1
torchvision==0.15.2
pandas
numpy
pinatapy-vourhey
python-dotenv
requests
flock-sdk
- Run
pip install -r requirements.txt
from your terminal in the current directory to ensure that all required packages are installed.
Directory Structure
credit_card_fraud_detection
┣ models
┃ ┗ CreditFraudNetMLP.py
┣ data
┃ ┗ creditCard.json
┣ .env
┣ Dockerfile
┣ flockCreditCardModel.py
┣ pinataApi.py
┣ requirements.txt
┗ uploadImage.sh
This is an overview of the file structure:
- models folder: Stores our model definition.
- data folder: Contains data for training and testing.
- .env: Environment variable file.
- Dockerfile: Text document to build Docker images.
flockCreditCardModel.py
: Main file for the flock model.- pinataApi.py: Python script to upload to Pinata → IPFS node.
- requirements.txt: Package requirements for running the model.
uploadImage.sh
: Bash script for running the upload.
Step-by-Step Guide
Starting with building the model for credit card fraud detection. Since the main purpose is to detect similarity between user’s transaction records and abnormal transaction behavior.
CreditFraudNetMLP
This class is a specific implementation of a neural network model using PyTorch. It defines the architecture and flow of data through the network for predictions.
- Architecture: Defines layers and operations transforming input data into output predictions.
- Purpose: Primarily focuses on mathematical transformations performed on the input data.
Dataset
When training a model, three main aspects are crucial: data, computation, and methodology. In this tutorial, the focus will mainly be on data and methodology. Please download via here
Data schema of a credit card fraud dataset looks like this:
{
"Time": 17187.0,
"V1": 1.0883749383,
"V2": 0.8984740237,
"V3": 0.3946843291,
"V4": 3.1702575745,
"V5": 0.1757387969,
// ... (other properties)
"V28": 0.0542542721,
"Amount": 3.79,
"Class": 1
}
You might be confused by the numerous properties/elements. Let’s dive deeper into the dataset:
- Time: A continuous variable representing seconds elapsed between this transaction and the first in the dataset.
- V1 to V28: Likely numerical variables resulting from a PCA transformation, often used to anonymize sensitive information.
- Amount: Represents the transaction amount.
- Class: A binary variable where:
- 1 indicates a fraudulent transaction.
- 0 indicates a non-fraudulent transaction.
Let’s create CreditFraudNetMLP.py
under the models directory credit_card_fraud_detection
. This file will serve as our model.
Preliminaries
To begin, we need to import all the required packages first.
from torch import nn
Here’s a brief introduction to all the packages, in case you’re unfamiliar with their purposes:
- torch: PyTorch is a Python package that offers two high-level features:
- Tensor computation (similar to NumPy) with robust GPU acceleration.
- Deep neural networks built on a tape-based autograd system.
To create a model, the first step is defining the method we’ll use. In this case, we’ll employ an NLP model to detect and learn from the data.
Constructor
def __init__(self, num_features, num_classes):
super(CreditFraudNetMLP, self).__init__()
The __init__
method serves as the constructor of the class, invoked upon instantiating an object of this class. It requires three arguments: self
, num_features
, and num_classes
. The super
function is utilized to invoke the same method from the parent class (nn.Module
).
Layers
self.fc1 = nn.Sequential(nn.Linear(num_features, 64), nn.ReLU(), nn.Dropout(0.2))
self.fc2 = nn.Sequential(nn.Linear(64, 128), nn.ReLU(), nn.Dropout(0.5))self.fc3 = nn.Sequential(nn.Linear(128, num_classes), nn.Sigmoid())
self.fc1
is defined as a sequential module, serving as a container for layers executed sequentially. It comprises:
- Linear Layer (
nn.Linear(num_features, 64)
): Creating a fully connected (linear) layer withnum_features
input features and 64 output features (neurons). - ReLU Activation (
nn.ReLU()
): Applying the Rectified Linear Unit (ReLU) activation function, linear output to non-linear output. - Dropout (
nn.Dropout(0.2)
): Applying dropout with a probability of 0.2 for regularization to prevent overfitting.
self.fc2
is another sequential module that contains:
- Linear Layer (
nn.Linear(64, 128)
): A linear layer with 64 input features and 128 output features. - ReLU Activation: Applying the ReLU activation function.
- Dropout (
nn.Dropout(0.5)
): Applying dropout with a higher probability of 0.5.
self.fc3
represents the output layer sequential module, featuring:
- Linear Layer (
nn.Linear(128, num_classes)
): A linear layer with 128 input features andnum_classes
output features. - Sigmoid Activation (
nn.Sigmoid()
): Applying the sigmoid activation function to constrain the output within the range [0, 1], suitable for binary classification.
The subsequent step involves defining a forward
method. This function primarily determines how data $x$ flows through the network during training, facilitating the forward pass—where data moves from one layer to another.
Example Torch Model
Let’s create flockCreditCardModel.py
file in the same directory and insert following.
import json
from torch.utils.data import DataLoader, TensorDataset
import io
import torch
from pandas import DataFrame
from flock_sdk import ~~FlockSDK,~~ FlockModel
from models.CreditFraudNetMLP import CreditFraudNetMLP
Initialisation Function Definition
class flockCreditCardModel(FlockModel):
def __init__(
self,
features,
epochs=1,
lr=0.03,
):
"""
Hyper parameters
"""
self.epochs = epochs
self.features = features
self.lr = lr
"""
Device setting
"""
if torch.cuda.is_available():
device = "cuda"
else:
device = "cpu"
self.device = torch.device(device)
Here we are trying to do two things, first define the hyper parameters, and check if user has GPU module within the device.
features
: The number of input features for the neural network.epochs
: The number of epochs for training the model.lr
: Learning rate for the optimizer during training.
Data handling
def init_dataset(self, dataset_path: str) -> None:
self.dataset_path = dataset_path
with open(dataset_path, "r") as f:
dataset = json.load(f)
dataset_df = DataFrame.from_records(dataset)
batch_size = 128
Firstly, we need to load the data and perform conversions. Here, we open and read the dataset, then transition to a dataframe. As discussed earlier, pandas is a tool that facilitates easier data manipulation. Additionally, it provides numerous useful functions for subsequent operations.
Preparing Data for the Model
X_df = dataset_df.iloc[:, :-1]
y_df = dataset_df.iloc[:, -1]
X_tensor = torch.tensor(X_df.values, dtype=torch.float32)
y_tensor = torch.tensor(y_df.values, dtype=torch.float32)
y_tensor = y_tensor.unsqueeze(1)
dataset_in_dataset = TensorDataset(X_tensor, y_tensor)
Next, we need to convert our data frame into PyTorch Tensors. Here, we perform the split again, generating two sets of data: features (x) and target (y).
- Features: Variables or columns in the dataset that provide information for the model to learn patterns and make predictions. Examples include transaction amount and time of transaction.
- Target: Variable (y), the ground truth that the model aims to predict. In the context of fraud detection, it is a binary variable indicating whether a transaction is fraudulent.
Let’s proceed with the code. We split the data into x_df
and y_df
, where x_df
selects all columns except the last one, and y_df
selects all rows of the last column. Then, we use the built-in function torch.tensor
to convert the data into PyTorch Tensors. Finally, we create a Tensor Dataset using these two tensors for x and y.
Setting Up Data Loaders
To streamline the process of feeding data into our model for both training and evaluation, we utilize PyTorch’s DataLoader. This tool allows for efficient data handling by batching, shuffling, and preparing the data for the model, ensuring optimal performance during the training process.
self.train_data_loader = DataLoader(
dataset_in_dataset,
batch_size=batch_size,
shuffle=True,
drop_last=False,
)
self.test_data_loader = DataLoader(
dataset_in_dataset,
batch_size=batch_size,
shuffle=True,
drop_last=False,
)
Training
def train(self, parameters) -> bytes:
model = CreditFraudNetMLP(num_features=self.features, num_classes=1)
if parameters != None:
model.load_state_dict(torch.load(io.BytesIO(parameters)))
model.train()
optimizer = torch.optim.SGD(
model.parameters(),
lr=self.lr,
)
criterion = torch.nn.BCELoss()
model.to(self.device)
Epochs and Batches
Next, we move on to the training stage. First, we need to prepare the model by retrieving it through the get_model
function. Then, we check if any extra parameters were provided. If so, we add them to the training process.
for epoch in range(self.epochs):
train_loss = 0.0
train_correct = 0
train_total = 0
for inputs, targets in self.train_data_loader:
optimizer.zero_grad()
inputs, targets = inputs.to(self.device), targets.to(self.device)
outputs = model(inputs)
loss = criterion(outputs, targets)
loss.backward()
optimizer.step()
train_loss += loss.item() * inputs.size(0)
predicted = torch.round(outputs).squeeze()
train_total += targets.size(0)
train_correct += (predicted == targets.squeeze()).sum().item()
print(
f"Training Epoch: {epoch}, Acc: {round(100.0 * train_correct / train_total, 2)}, Loss: {round(train_loss / train_total, 4)}"
)
buffer = io.BytesIO()
torch.save(model.state_dict(), buffer)
return buffer.getvalue()
Next, we need to define the loop for training. The training process adjusts the model’s internal parameters to minimize the discrepancy between its predictions and the actual outcomes. This iterative process progressively improves the model’s predictive accuracy.
- Epoch Loop
- Concept: An epoch represents one complete pass through the entire training dataset.
- Importance: Multiple epochs allow the model to sufficiently learn underlying patterns in the data.
- Note: Too many epochs might lead to overfitting, reducing the model’s ability to generalize.
- Batch Loop
- Batch Training: Mini-batch training updates parameters after a specified number of samples (a batch).
- DataLoader: The
train_data_loader
provides batches of data for efficient training and data shuffling. - Note: Batch size is a crucial hyperparameter affecting training speed and stability.
The training process meticulously adjusts the model’s internal weights to minimize loss, enhancing its ability to predict fraudulent transactions accurately. Through this methodical approach, we ensure that the model can learn effectively from the data provided, setting the stage for robust fraud detection capabilities.
Evaluation
def evaluate(self, parameters: bytes) -> float:
criterion = torch.nn.BCELoss()
model = CreditFraudNetMLP(num_features=self.features, num_classes=1)
if parameters is not None:
model.load_state_dict(torch.load(io.BytesIO(parameters)))
model.to(self.device)
model.eval()
test_correct = 0
test_loss = 0.0
test_total = 0
The evaluation code, resembling the training code, differs in using self.testing_data_loader
instead of self.training_data_loader
. During evaluation, no model parameter updates occur. Instead, the purpose is to assess performance using the test set data.
Calculating Loss and Accuracy
with torch.no_grad():
for inputs, targets in self.test_data_loader:
inputs, targets = inputs.to(self.device), targets.to(self.device)
outputs = model(inputs)
loss = criterion(outputs, targets)
The primary evaluation metrics include loss rate and accuracy rate. This function returns the final accuracy calculated using these metrics.
Calculating and Returning Evaluation Metrics
Lastly, we aggregate the results from our evaluation to calculate the total loss and accuracy. This involves summing up the losses, rounding the model’s outputs to determine predictions, and comparing these predictions against the actual targets to count the number of correct predictions.
test_loss += loss.item() * inputs.size(0) # Calculating Cumulative Loss
predicted = torch.round(outputs).squeeze() # Rounding Model Outputs
test_total += targets.size(0) # Tracking Total Samples
test_correct += (predicted == targets.squeeze()).sum().item() # Calculating Correct Predictions
By completing these calculations, we can return the final accuracy, providing a clear metric to gauge the effectiveness of our model in detecting credit card fraud.
Aggregation
The aggregation step aims to calculate the average model parameters of multiple models collected from selected participants, and return the averaged model weights/gradients to all participants. This process, central in federated learning, involves learning model parameters across various participants (e.g. devices and nodes) and aggregating them at a central location.
def aggregate(self, parameters_list: list[bytes]) -> bytes:
Load and Initialize
Initially, loading the selected participants’ model parameters and initializing the template.
parameters_list = [
torch.load(io.BytesIO(parameters)) for parameters in parameters_list
]
averaged_params_template = parameters_list[0]
Then, averaging model parameters:
- Iterating through each parameter
k
in the template and aggregating associated parameter values from all sets inparameters_list
. - Calculating the average of these values and updating the template’s parameter
k
.
Compute Averages
for k in averaged_params_template.keys():
temp_w = []
for local_w in parameters_list:
temp_w.append(local_w[k])
averaged_params_template[k] = sum(temp_w) / len(temp_w)
- Iteratively accessing each parameter in the template.
- Gathering the corresponding parameter values from all models in the
parameters_list
. - Calculating the mean of these values.
- Updating the template parameter with its new averaged value.
Serialize Averaged Parameters
Lastly, creating a buffer to store the aggregated parameters in byte format.
buffer = io.BytesIO()
# Saving state dict to the buffer
torch.save(averaged_params_template, buffer)
# Getting the byte representation
aggregated_parameters = buffer.getvalue()
return aggregated_parameters
Call the model
from flock_sdk import FlockSDK
from flockCreditCardModel import flockCreditCardModel
if __name__ == "__main__":
epochs = 1
lr = 0.000001
features = 30
model = flockCreditCardModel(features, epochs=epochs, lr=lr)
sdk = FlockSDK(model)
sdk.run()
Great! If all goes well, you’ve completed the first part of the tutorial. Here’s the entrance to part 2, where we will work on how this model can be used with FLock Client, link here
Reach out to us by
Website: https://flock.io/
Twitter: https://twitter.com/flock_io
Telegram: https://t.me/flock_io_community
Discord: https://discord.gg/ay8MnJCg2W