Where to start with text classification from a two column (label, text) data source?

Question

We receive abuse complaints from third parties. I've exported a whole whack of the complaints in XML format and then compiled them into a pandas DataFrame() while scrubbing things like email addresses, hostnames, URLs, and IP addresses out like the following.

The file 'learning_data.txt' consists of thousands of lines each looking like this:

Script so far

#!/usr/bin/env python

import pandas as pd


def main():
    data = open('learning_data.txt').readlines()

    print('Loading data...')
    labels, texts = ([], [])
    for line in data:
        label, text = line.split(':', 1)
        labels.append(label)
        texts.append(text)

    print('Adding to pandas DataFrame()')
    trainDF = pd.DataFrame()
    trainDF['label'] = labels
    trainDF['text'] = texts

    print(trainDF)


if __name__ == '__main__':
    main()

output

                                                   label                                               text
8            Attacks and Reconnaissance__SSH Brute Force   Abuse from ... Dear Administrator,  We have d...
9                        Malicious Code/Traffic__Unknown    - [ Vulnerable Host in Canada] In support of...
10               Fraud__Copyright/Trademark Infringement   Unauthorized Use of Copyrights RE: TC--b--- *...
...                                                  ...                                                ...
43635                    Malicious Code/Traffic__Unknown   tdss report about ... last detected -- :: Sec...
43636                                    Fraud__Phishing   Issue : phishing attack at /// Dear Sir or Ma...

The label format is __ because I don't expect to do multiple classifications yet, if ever.

All of the demos I've seen for machine learning and text classification use some black-box data source like the 20 newsgroups etc. Since I'm starting with my own data I'm having trouble figuring it into the examples/tutorials.

Edit: I'm using Python 3.6.6

Where do I go from here?

Should I be using sklearn or some other library? Pytorch? How do I make features out of the text and add those to a label? How do I write the learned data out so I can then use another script to use that dataset to predict the labels for new text?

I'm starting from scratch here with machine learning but I've done tons of stuff in Python unrelated to machine learning.

Ekaba Bisong · Accepted Answer

You can use SKLearn CountVectorizer or TF-IDF. Here's a rough outlay for an approach:

from sklearn.feature_extraction.text import CountVectorizer

from sklearn.ensemble import RandomForestClassifier  
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

count_vect = CountVectorizer()

text = ['text1', ..] 

targets = ['abuse', ...]

matrix = count_vect.fit_transform(text)

encoder = LabelEncoder()
targets = encoder.fit_transform(targets)

randomForest = RandomForestClassifier()

randomForest.fit(matrix, targets)

Where to start with text classification from a two column (label, text) data source?

Answers (1)

Related Questions