What You Will Learn
- Craft a sentiment analysis model for Web3 information
- Model training flow
- Comprehensive guide
Technology Stack
- Python ^3.10
- Docker@latest
Setup
- Create a new directory called sentiment-analysis.
- Create a requirements.txt file.
- Within the requirements.txt, add the following:
flock-sdk=0.01# For non ARM architecture # --find-links <https://download.pytorch.org/whl/torch_stable.html> # CUDA 4.76GB # torch==2.0.1+cu117; sys_platform == 'linux' # torchvision==0.15.2+cu117; sys_platform == 'linux'# CPU 1.85GB - ARM torch==2.0.1 torchvision==0.15.2# # CPU 1.85GB - linux # torch==2.0.1+cpu; sys_platform == 'linux' # torchvision==0.15.2+cpu; sys_platform == 'linux'# # CPU 1.85GB - x86_64 # torch==2.0.1+cpu; sys_platform == 'x86_64' # torchvision==0.15.2+cpu; sys_platform == 'x86_64'# For developer pandas scikit-learn tqdm numpy ~~# lightning~~ pinatapy-vourhey python-dotenv requests
- Run pip install -r requirements.txt to ensure all required packages are installed.
How it Works
Understanding the general flow is our starting point. For sentiment analysis, the primary objective is to determine the emotional tone or sentiment of a piece of text, utilizing natural language processing techniques.
File Structure
sentiment-analysis ┣ model ┃ ┗ CNNModel.py ┣ .env ┣ Dockerfile ┣ FLockSentimentModel.py ┣ dataProcessing.py ┣ dataset.json ┣ pinata_api.py ┣ requirements.txt ┗ upload_image.sh
Model
Before delving into the CNN code, let’s first comprehend the code structure and the flow of the CNN. In this example, we’ll use only 4 layers: 1 Embedding Layer, 2 convolutional layers and a Sequential layer.
- Embedding Layer: It changes simple, group-based input features (like words) into full, ongoing vectors.
- Convolutional Layers: These layers help the model understand spatial hierarchies or patterns within compact vectors.
- Sequential Layer: Generates a number between 0 and 1, indicating the chance of belonging to a certain group in a binary classification problem.
Data Flow Chart:
Input data → Embedding Layer → Convolutional Layer 1 → Convolutional Layer 2 → Sequential Layer → Output
Let’s begin by importing the necessary packages:
from torch import nn import torch.nn.functional as F
Now, let’s start coding the classifier with a basic structure:
class CNNClassifier(nn.Module): def __init__(self, vocab_size, emb_size): super().__init__() # Code goes here def forward(self, x): # Code goes here
Understanding three key variables:
- nn.Module: PyTorch's function wrapper, the base class for all neural networks.
- vocab_size: Size of the vocabulary the model deals with.
- emb_size: Size of the vectors into which words are mapped by the embedding layer.
Let’s define the layers:
- Embedding:
self.embedding = nn.Embedding(vocab_size, emb_size)
- Convolutional Layers:
self.conv1 = nn.Conv1d(emb_size, 100, 5, padding=2) self.conv2 = nn.Conv1d(100, 100, 5, padding=2)
- Each convolutional layer requires a Rectified Linear Unit (ReLU): nn.ReLU().
- Sequential layer:
self.fc3 = nn.Sequential(nn.Linear(100, 1), nn.Sigmoid())
nn.Linear(100, 1): A linear transformation of the incoming data, where the input feature size is 100 and the output size is 1.
nn.Sigmoid(): Applies the sigmoid function to squash the output values between 0 and 1.
Forward Passing
The forward pass calculates and stores intermediate variables from the input layer to the output layer in neural networks. Let’s craft this function:
We start with embedding:
embs = self.embedding(x)
Next, activate the two convolutional layers via ReLU:
h = F.relu(self.conv1(embs.transpose(1, 2))) h = F.relu(self.conv2(h)) h_size = h.size(dim=2) h = F.avg_pool1d(h, h_size).squeeze(dim=2)
Lastly, map the learning to probabilities:
logits = self.fc3(h) return logits
Data processing
Before training, we need to preprocess the dataset. Data processing components play a fundamental role in neural network training, particularly in the context of natural language processing. Typically, there are four steps in the data processing process:
- Text Cleaning — Regular expressions
- Tokenization — Splitting into words
- Numerical Representation — Word indexing
- Handling sequence length — Padding & Truncation
Let’s start with initialisation. We’ll create an __init__ function for our initialisation process. Several parameters need to be passed into this function:
- dataset: The input dataset.
- vocab: An optional pre-existing vocabulary. The default will be None, and if not provided, we'll generate it from the dataset.
- max_seq_len: The maximum sequence length.
- device: The device for computations. In this case, we want to use cuda.
- max_samples_count: The maximum number of samples to consider from the input dataset. This parameter acts as a way to limit the dataset size.
- max_vocab_size: The maximum number of words in the vocabulary.
Now, let’s assemble everything together. We’ll receive the following part.
def __init__(self, dataset, vocab=None, max_seq_len=64, device="cuda", max_samples_count=20000,max_vocab_size=30000): self.samples = [] self.labels = [] self.max_seq_len = max_seq_len self.device = device
Parsing through the dataset (assuming it’s in .csv format):
for row in dataset: label = row[0] sample = row[1] # Now we need to tokenize the samples # 1. Adds a space before punctuation, helping in tokenization sample = re.sub(r"([.,!?'])", r" \\1", sample) # 2. Retains only alphanumeric characters, specific punctuation marks, # dand apostrophes, removing all other characters. sample = re.sub(r"[^a-zA-Z0-9.,!?']", " ", sample) self.labels.append(int(label) - 1) self.samples.append(sample) # Once we reach the max samples count, stop the loop. if len(self.samples) > max_samples_count: break
Lastly, during the init stage, we need to create a vocab list as discussed earlier.
If the vocab list is provided by the user. We simply initialize the vocab list with the user’s vocab list. Otherwise, we create the vocab list from the dataset.
if vocab is not None: self.vocab = vocab else: self.vocab = self._make_vocab(max_vocab_size=max_vocab_size)
Built-in support function override:
- __len__: Returns the length of the samples.
- __getitem__: A special method in PyTorch's Dataset class allowing retrieval of a specific sample from the dataset using its index.
- _make_vocab: Returns the vocabulary list from the dataset.
def __len__(self): return len(self.samples) def __getitem__(self, index): sample = self.samples[index] sample = [self.get_index(word) for word in sample.split()] sample = sample[:self.max_seq_len] pad_len = self.max_seq_len - len(sample) sample += [self.get_index("[PAD]")] * pad_len label = self.labels[index] return sample, label def _make_vocab(self, max_vocab_size=30000): vocab = {"[PAD]": 1000000000000001, "[UNK]": 100000000000000} for sample in self.samples: for word in sample.split(): if word not in vocab.keys(): vocab[word] = 1 else: vocab[word] += 1 vocab = dict(sorted(vocab.items(), key=lambda item: item[1], reverse=True)) vocab = list(vocab.keys())[:max_vocab_size] return vocab
Customized support function:
- get_vocab: Returns the list of vocab.
- get_index: Returns the index of a word.
def get_vocab(self): return self.vocab def get_index(self, word): if word in self.vocab: index = self.vocab.index(word) else: index = self.vocab.index("[UNK]") return index
Collate Function
The collate method is a utility function commonly used with PyTorch's DataLoader to determine how individual data points (samples) are combined into batches for training or evaluation.
This function takes one parameter, batch, which is a list of data points (samples).
Next, we unzip the batch using list(zip(*batch)). We expect two results from this operation: input_ids and targets.
- input_ids is a list of lists, where each inner list corresponds to the input_ids of a data point.
- targets is a list of labels.
def collate(self, batch): input_ids, targets = list(zip(*batch))
Then, we convert the data into tensors using torch.tensor(), transforming lists of word indices and labels into tensors optimized for GPU computations. This will be the return value of the collate function.
return torch.tensor(input_ids), torch.tensor(targets, dtype=torch.float32)
get_loader Function
- Operation: Returns a DataLoader allowing iteration over the dataset in batches.
- Parameters: Accepts the dataset and batch size as parameters.
def get_loader(dataset_df, batch_size): return DataLoader(dataset_df, batch_size=batch_size, num_workers=1, collate_fn=dataset_df.collate)
Congratulations on completing the first part of the tutorial! Now, you can move on to Part 2. Let’s continue our conversation there.
Reach out to us by
Website: https://flock.io/
Twitter: https://twitter.com/flock_io
Telegram: https://t.me/flock_io_community
Discord: https://discord.gg/ay8MnJCg2W