Negative Correlation
Negative Correlation

Reputation: 983

Combining text and tabular data in PyTorch for classification model

I have a dataset that consists of customers and their product purchases for an ecommerce company that sells clothes.  Along with this data, I have application logs that show the customer’s interactions on the site. The data looks something like this:

import pandas as pd
data = {'customer_id':[369799, 103508, 294535, 222573, 204286, 254953, 268167, 56201, 168900, 96618],
 'application_log':['web_pdp__click_main_banner web_pdp__click_prod',
 'web_pdp__click_main_banner web_pdp__click_prod web_pdp__click_sub',
 'web_home__click_main_banner web_home__click_prod',
 'web_pdp__click_main_banner web_pdp__click_prod web_pdp__click_sub web_pdp__click_sub web_pdp__click_sub web_pdp__click_sub web_pdp__click_sub',
 'web_pdp__click_main_banner web_pdp__click_prod web_pdp__view_hero web_pdp__hover_index web_pdp__click_sub',
 'web_pdp__click_main_banner web_pdp__click_prod web_pdp__click_sub web_pdp__click_sub web_pdp__click_sub',
 'web_pdp__click_main_banner web_pdp__click_prod web_pdp__click_sub web_pdp__click_sub web_pdp__click_sub web_pdp__click_sub web_pdp__click_sub web_pdp__click_sub',
 'web_pdp__click_main_banner web_pdp__click_prod web_pdp__click_sub web_pdp__click_sub web_pdp__click_sub web_pdp__click_sub web_pdp__click_sub web_pdp__click_sub web_pdp__click_sub',
 'web_pdp__click_main_banner web_pdp__click_prod web_pdp__click_sub',
 'web_home__click_main_banner web_home__click_prod'],
 'var_1':[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 'var_2':[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 'var_3':[1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
 'var_4':[0, 1, 0, 5, 1, 3, 6, 7, 1, 0],
 'var_5':[0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
 'targets':[1, 1, 0, 1, 1, 1, 1, 1, 1, 1]}
data = pd.DataFrame(data)

out:

customer_id application_log var_1 var_2 var_3 var_4 var_5 targets
369799 web_pdp__click_main_banner web_pdp__click_prod 0 0 1 0 0 1
103508 web_pdp__click_main_banner web_pdp__click_prod web_pdp__click_sub 0 0 1 1 0 1
294535 web_home__click_main_banner web_home__click_prod 0 0 1 0 0 0
222573 web_pdp__click_main_banner web_pdp__click_prod web_pdp__click_sub web_pdp__click_sub web_pdp__click_sub web_pdp__click_sub web_pdp__click_sub 0 0 1 5 0 1
204286 web_pdp__click_main_banner web_pdp__click_prod web_pdp__view_hero web_pdp__hover_index web_pdp__click_sub 0 0 1 1 1 1
254953 web_pdp__click_main_banner web_pdp__click_prod web_pdp__click_sub web_pdp__click_sub web_pdp__click_sub 0 0 1 3 0 1
268167 web_pdp__click_main_banner web_pdp__click_prod web_pdp__click_sub web_pdp__click_sub web_pdp__click_sub web_pdp__click_sub web_pdp__click_sub web_pdp__click_sub 0 0 1 6 0 1
56201 web_pdp__click_main_banner web_pdp__click_prod web_pdp__click_sub web_pdp__click_sub web_pdp__click_sub web_pdp__click_sub web_pdp__click_sub web_pdp__click_sub web_pdp__click_sub 0 0 1 7 0 1
168900 web_pdp__click_main_banner web_pdp__click_prod web_pdp__click_sub 0 0 1 1 0 1
96618 web_home__click_main_banner web_home__click_prod 0 0 1 0 0 1

I want to predict the probability of a customer making a subsequent purchase as denoted in the field “targets” above.   I would like to do this in PyTorch using a “customer-as-a-text” paradigm, whereby the customer’s session logs are concatenated into discrete tokens and grouped into “customer-sentences”, which are then used to learn “customer-embeddings”. Similar to what’s being explained in this diagram below:

enter image description here

https://blog.griddynamics.com/customer2vec-representation-learning-and-automl-for-customer-analytics-and-personalization/

The diagram comes from the tutorial that I am trying to emulate, which mentions using doc2vec to generate embeddings and then concatenating those embeddings with the remaining tabular data. The problem is that I don’t quite understand how to do this in PyTorch. Specifically, I don’t know how to create the model nor the custom dataset required for PyTorch.

I would very much appreciate it if someone could provide me with code that takes this data and converts it into a PyTorch dataset, as well as code for a multi-modal model that uses an LSTM layer for the text features and then combines the text features with the remaining numerical features in a multi-layer perceptron to predict the probability of the target column.

I found one specific tutorial that does this, except it uses PyTorch Lightning, which is something I want to avoid (https://drivendata.co/blog/hateful-memes-benchmark).

I currently havent written any code for the actual model. But my current code for the dataset looks something like below, but I feel that I'm going about it all wrong, especially in terms of the text data:

import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
import torch.nn.functional as F
from torchtext.vocab import build_vocab_from_iterator
from torch.nn.utils.rnn import pad_sequence
    
class Vocabulary:
    
        """
        __init__ method is called by default as soon as an object of this class is initiated
        we use this method to initiate our vocab dictionaries
        """
    
        def __init__(self, freq_threshold, max_size):
            """
            freq_threshold : the minimum times a word must occur in corpus to be treated in vocab
            max_size : max source vocab size
            """
            # initiate the index to token dict
            self.itos = {0: "<PAD>", 1: "<SOS>", 2: "<EOS>", 3: "<UNK>"}
            # initiate the token to index dict
            self.stoi = {k: j for j, k in self.itos.items()}
    
            self.freq_threshold = freq_threshold
            self.max_size = max_size
    
        """
        __len__ is used by dataloader later to create batches
        """
    
        def __len__(self):
            return len(self.itos)
    
        """
        a simple tokenizer to split on space and converts the sentence to list of words
        """
    
        @staticmethod
        def tokenizer(text):
            return [tok.lower().strip() for tok in text.split(" ")]
    
        """
        build the vocab: create a dictionary mapping of index to string (itos) and string to index (stoi)
        output ex. for stoi -> {'the':5, 'a':6, 'an':7}
        """
    
        def build_vocabulary(self, sentence_list):
            # calculate the frequencies of each word first to remove the words with freq < freq_threshold
            frequencies = {}  # init the freq dict
            idx = 4  # index from which we want our dict to start. We already used 4 indexes for pad, start, end, unk
    
            # calculate freq of words
            for sentence in sentence_list:
                for word in self.tokenizer(sentence):
                    if word not in frequencies.keys():
                        frequencies[word] = 1
                    else:
                        frequencies[word] += 1
    
            # limit vocab by removing low freq words
            frequencies = {k: v for k, v in frequencies.items() if v > self.freq_threshold}
    
            # limit vocab to the max_size specified
            if len(frequencies) > self.max_size - idx:
                frequencies = dict(
                    sorted(frequencies.items(), key=lambda x: -x[1])[: self.max_size - idx]
                )  # idx =4 for pad, start, end , unk
    
            # create vocab
            for word in frequencies.keys():
                self.stoi[word] = idx
                self.itos[idx] = word
                idx += 1
    
        """
        convert the list of words to a list of corresponding indexes
        """
    
        def numericalize(self, text):
            tokenized_text = self.tokenizer(text)
            numericalized_text = []
            for token in tokenized_text:
                if token in self.stoi.keys():
                    numericalized_text.append(self.stoi[token])
                else:  # out-of-vocab words are represented by UNK token index
                    numericalized_text.append(self.stoi["<UNK>"])
    
            return numericalized_text
    
    class MyDataset(Dataset):
        def __init__(self, df, target, text):
            x = df.drop([target, text], axis=1).values.astype(int)
            self.x_text = df[text]
            y = df[target].values.astype(int)
    
            self.x_text_voc = Vocabulary(1, 100)
            self.x_text_voc.build_vocabulary(self.x_text.tolist())
    
            self.x_train = torch.tensor(x, dtype=torch.int64)
            self.y_train = torch.tensor(y, dtype=torch.int64)
    
        def __len__(self):
            return len(self.y_train)
    
        def __getitem__(self, idx):
            self.text_vector = self.x_text[idx]
    
            self.num_source = [self.x_text_voc.stoi["<SOS>"]]
            self.num_source += self.x_text_voc.numericalize(self.text_vector)
            self.num_source.append(self.x_text_voc.stoi["<EOS>"])
    
            return self.x_train[idx], torch.tensor(self.num_source), self.y_train[idx]
    
    
    class MyCollate:
        def __init__(self, pad_idx):
            self.pad_idx = pad_idx
    
        # __call__: a default method
        ##   First the obj is created using MyCollate(pad_idx) in data loader
        ##   Then if obj(batch) is called -> __call__ runs by default
        def __call__(self, batch):
            # get all source indexed sentences of the batch
            source = [item[0] for item in batch]
            # pad them using pad_sequence method from pytorch.
            source = pad_sequence(source, batch_first=False, padding_value=self.pad_idx)
    
            # get all target indexed sentences of the batch
            target = [item[1] for item in batch]
            # pad them using pad_sequence method from pytorch.
            target = pad_sequence(target, batch_first=False, padding_value=self.pad_idx)
            return source, target
    
    
    def get_train_loader(
        dataset, batch_size, num_workers=0, shuffle=True, pin_memory=False
    ):
        # get pad_idx for collate fn
        pad_idx = dataset.x_text_voc.stoi["<PAD>"]
        # define loader
        loader = DataLoader(
            dataset,
            batch_size=batch_size,
            num_workers=num_workers,
            shuffle=shuffle,
            pin_memory=pin_memory,
            collate_fn=MyCollate(pad_idx=pad_idx),
        )
        return loader
    
    
    train = MyDataset(data, "targets", "application_log")

Upvotes: 0

Views: 690

Answers (1)

Warkaz
Warkaz

Reputation: 913

I think the approach to grouping it together into a string doesn't yield any benefits, especially if the domain of the application log strings is already fixed by the other app. I would suggest you maybe use bag of words for that to transform it into numbers. Essentially web_pdp__click_main_banner translates to 0 next one to 1 etc. For more information encourage to check examples here. You would put this code, inside your example, and produce the outputted numerical values for your model to learn.

This way, you can view your logs as sequence of events (maybe limited or not) and use potentially RNN or limit the input and used a fixed network with <EOS> characters (end of sequence). This way you don't have to train your own word2vec or doc2vec.

But even if you want to do so, you would be doing it only on the string data, you would use the output of the embedding to concatenate with other features as seen on your table.

Upvotes: -1

Related Questions