Reputation: 414
I'm new to mlpy library and looking for the best way to implement a sentences classification. I was thinking to use the mply Basic Perceptron to do it but from my understanding it is using pre-defined vector size but i need the size of the vector to be dynamically increased while the machine is learning because I wouldn't like to create a huge vector (of all English words). What i actually need to do is to get a list of sentences and build a classifier vector from them and then when the application will get new sentence it will try to classify it automatically to one of the labels (supervised learning).
Any ideas, thought and examples will be very helpful,
Thanks
Upvotes: 1
Views: 205
Reputation: 2334
If you have all the sentences beforehand, you can prepare a list of words (removing stop words) to map every word to a feature. The size of the vector would be the number of words in the dictionary.
Once, you have that, you can train a perceptron.
Have a look at my code in which I did the mapping in Perl followed by perceptron implementation in matlab to understand how it works and write a similar implementation in python
Preparing the bag of words model (Perl)
use warnings;
use strict;
my %positions = ();
my $n = 0;
my $spam = -1;
open (INFILE, "q4train.dat");
open (OUTFILE, ">q4train_mod.dat");
while (<INFILE>) {
chomp;
my @values = split(' ', $_);
my %frequencies = ();
for (my $i = 0; $i < scalar(@values); $i = $i+2) {
if ($i==0) {
if ($values[1] eq 'spam') {
$spam = 1;
}
else {
$spam = -1;
}
}
else {
$frequencies{$values[$i]} = $values[$i+1];
if (!exists ($positions{$values[$i]})) {
$n++;
$positions{$values[$i]} = $n;
}
}
}
print OUTFILE $spam." ";
my @keys = sort { $positions{$a} <=> $positions{$b} } keys %positions;
foreach my $word (@keys) {
if (exists ($frequencies{$word})) {
print OUTFILE " ".$positions{$word}.":".$frequencies{$word};
}
}
print OUTFILE "\n";
}
close (INFILE);
close (OUTFILE);
open (INFILE, "q4test.dat");
open (OUTFILE, ">q4test_mod.dat");
while (<INFILE>) {
chomp;
my @values = split(' ', $_);
my %frequencies = ();
for (my $i = 0; $i < scalar(@values); $i = $i+2) {
if ($i==0) {
if ($values[1] eq 'spam') {
$spam = 1;
}
else {
$spam = -1;
}
}
else {
$frequencies{$values[$i]} = $values[$i+1];
if (!exists ($positions{$values[$i]})) {
$n++;
$positions{$values[$i]} = $n;
}
}
}
print OUTFILE $spam." ";
my @keys = sort { $positions{$a} <=> $positions{$b} } keys %positions;
foreach my $word (@keys) {
if (exists ($frequencies{$word})) {
print OUTFILE " ".$positions{$word}.":".$frequencies{$word};
}
}
print OUTFILE "\n";
}
close (INFILE);
close (OUTFILE);
open (OUTFILE, ">wordlist.dat");
my @keys = sort { $positions{$a} <=> $positions{$b} } keys %positions;
foreach my $word (@keys) {
print OUTFILE $word."\n";
}
Perceptron Implementation (Matlab)
clc; clear; close all;
[Ytrain, Xtrain] = libsvmread('q4train_mod.dat');
[Ytest, Xtest] = libsvmread('q4test_mod.dat');
mtrain = size(Xtrain,1);
mtest = size(Xtest,1);
n = size(Xtrain,2);
% part a
% learn perceptron
Xtrain_perceptron = [ones(mtrain,1) Xtrain];
Xtest_perceptron = [ones(mtest,1) Xtest];
alpha = 0.1;
%initialize
theta_perceptron = zeros(n+1,1);
trainerror_mag = 100000;
iteration = 0;
%loop
while (trainerror_mag>1000)
iteration = iteration+1;
for i = 1 : mtrain
Ypredict_temp = sign(theta_perceptron'*Xtrain_perceptron(i,:)');
theta_perceptron = theta_perceptron + alpha*(Ytrain(i)-Ypredict_temp)*Xtrain_perceptron(i,:)';
end
Ytrainpredict_perceptron = sign(theta_perceptron'*Xtrain_perceptron')';
trainerror_mag = (Ytrainpredict_perceptron - Ytrain)'*(Ytrainpredict_perceptron - Ytrain)
end
Ytestpredict_perceptron = sign(theta_perceptron'*Xtest_perceptron')';
testerror_mag = (Ytestpredict_perceptron - Ytest)'*(Ytestpredict_perceptron - Ytest)
I don't want to code the same thing in Python again but this should give you a direction on how to proceed
Upvotes: 1