Muhammet Can
Muhammet Can

Reputation: 1354

Which algorithms/concepts should i dig for author prediction

I have been working on something that will try to figure out the author of a column by using my own data set.

I'm planning to use mlpy python library. It has good documentation, (about 100 pages of pdf). I'm also open to other library suggestions.

The thing is, I'm lost in Data Mining and Machine Learning concepts. There is too much work on it, too many algorithms and concepts.

I'm asking for directions, what algorithms / concepts should I learn, and search for my specific problem.

So far, I've built a dataset which is something like this.

| author | feature x | feature y | feature z | some more features |
|--------+-----------+-----------+-----------+--------------------|
| A      |         2 |         4 |         6 | ..                 |
| A      |         1 |         1 |         5 | ..                 |
| B      |        12 |        15 |         9 | ..                 |
| B      |        13 |        13 |        13 | ..                 |

Now, I'll get a new column and parse it, after that I will have all the features for the column, and my aim is to figure out who the author of that column is.

As I'm not a ML guy, I can only think of getting a distance between the features on all rows and pick the closest one. But I'm pretty sure this is not the way I should go.

I'd appreciate any directions, links, readings etc.

Upvotes: 4

Views: 881

Answers (4)

Zhubarb
Zhubarb

Reputation: 11895

Given that you are not familiar with ML, the first three algorithms I would recommend would be:

1- Logistic Regression 2- Naive Bayes 3- Support Vector Machines

If you are only interested in predictive performance, have enough training data and have no missing values, you will find that using more complex methodologies, such as Bayesian Networks, will not result in statistically significant improvements in your predictive performance. Even if they do, you should start with these three (relatively) simple methodologies and use them as reference benchmarks.

Upvotes: 1

petrichor
petrichor

Reputation: 6569

If you have enough training data, then you can use kNN (k-Nearest Neighbor) classifier for your purpose. It is easy to understand, yet powerful.

Check scikits.ann for a possible implementation.

This tutorial here serves as a good reference for the one in scikits-learn.

Edit: In addition, here is the page for kNN of scikits-learn. You can understand it easily from the given example.

And, mlpy also seems to have kNN.

Upvotes: 3

Upul Bandara
Upul Bandara

Reputation: 5958

As others mentioned, you can use a lot of algorithms for authorship attribution. kNN is a good starting point. Further, you can try several other algorithms such as Logistic Regression, Naïve Bayes Classifier, and Neural Networks which probably give more accurate predictions.

I’m also interested in authorship attribution and plagiarism detection. In fact, I have used above techniques for source code authorship attribution. You can read more about these, by using following research papers.

  1. http://www.ijmlc.org/papers/50-A243.pdf [A Machine Learning Based Tool for Source Code Plagiarism Detection]
  2. http://dl.acm.org/citation.cfm?id=2423074 [Source code author identification with unsupervised feature learning]

Moreover, if you are planning to use Python, you can also look at http://scikit-learn.org/stable/ library. This is also a comprehensive library which comes with a nice documentation.

Upvotes: 2

Pedrom
Pedrom

Reputation: 3823

You have a wide selection of algorithms implemented on mlpy so you should be fine. I agree with Steve L when said that Support Vector Machines is great, but even when it is easier to use the inner details are not easy to grasp especially if you are new in ML.

Additionally to kNN, you could consider Classification Tree (http://en.wikipedia.org/wiki/Decision_tree_learning) and Logistic Regression (http://en.wikipedia.org/wiki/Logistic_regression).

For starters, Decision trees have the advantage that would produce an output that it is easy to understand and hence easier to debug.

Logistic Regression on the other hand, can give you good results and scale very well if you need more data.

I would say that in your case, you would be looking for the algorithm which after reading a bit you find more comfortable to work with. Most of the time, all of them are very capable to give you very decent results. Good luck!

Upvotes: 2

Related Questions