Reputation: 826
We receive abuse complaints from third parties. I've exported a whole whack of the complaints in XML format and then compiled them into a pandas DataFrame() while scrubbing things like email addresses, hostnames, URLs, and IP addresses out like the following.
The file 'learning_data.txt' consists of thousands of lines each looking like this:
<label>:<a long string of text>
Script so far
#!/usr/bin/env python
import pandas as pd
def main():
data = open('learning_data.txt').readlines()
print('Loading data...')
labels, texts = ([], [])
for line in data:
label, text = line.split(':', 1)
labels.append(label)
texts.append(text)
print('Adding to pandas DataFrame()')
trainDF = pd.DataFrame()
trainDF['label'] = labels
trainDF['text'] = texts
print(trainDF)
if __name__ == '__main__':
main()
output
label text
8 Attacks and Reconnaissance__SSH Brute Force Abuse from ... Dear Administrator, We have d...
9 Malicious Code/Traffic__Unknown - [ Vulnerable Host in Canada] In support of...
10 Fraud__Copyright/Trademark Infringement Unauthorized Use of Copyrights RE: TC--b--- *...
... ... ...
43635 Malicious Code/Traffic__Unknown tdss report about ... last detected -- :: Sec...
43636 Fraud__Phishing Issue : phishing attack at /// Dear Sir or Ma...
The label format is __ because I don't expect to do multiple classifications yet, if ever.
All of the demos I've seen for machine learning and text classification use some black-box data source like the 20 newsgroups etc. Since I'm starting with my own data I'm having trouble figuring it into the examples/tutorials.
Edit: I'm using Python 3.6.6
Where do I go from here?
Should I be using sklearn or some other library? Pytorch? How do I make features out of the text and add those to a label? How do I write the learned data out so I can then use another script to use that dataset to predict the labels for new text?
I'm starting from scratch here with machine learning but I've done tons of stuff in Python unrelated to machine learning.
Upvotes: 1
Views: 1088
Reputation: 2982
You can use SKLearn CountVectorizer
or TF-IDF
. Here's a rough outlay for an approach:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
count_vect = CountVectorizer()
text = ['text1', ..]
targets = ['abuse', ...]
matrix = count_vect.fit_transform(text)
encoder = LabelEncoder()
targets = encoder.fit_transform(targets)
randomForest = RandomForestClassifier()
randomForest.fit(matrix, targets)
Upvotes: 1