Coke
Coke

Reputation: 985

Regression analyses in Accord.NET

Currently I am working on my project at school and I have a bit extraordinary task. My job is to scrape the data from a certain page on the facebook put that into learning model, where it should have 1 input as List and output as Int32.

Firstly, let me briefly explain algorithms I already designed:

  1. Scraped the data
  2. Stemmed it
  3. Removed capitalization, punctuation, emojis and spaces
  4. Merged words with the same root
  5. Counted occurrence of words and assigned count value to every word
  6. Performed tf-idf calculation to extract weights of every word in every post Now, I have a Dictionary<String,List<double[],int>>, which represents

postId:[wordWeights],amountOfLikes as

23425234_35242352:[0.027,0.031,0.009,0.01233],89

I have to train my model with different posts and their likes. For this purpose, have chosen to use Accord.NET library on C# and so far analyzed their Simple Linear Regression Class.

Firstly, I saw that I can use OrdinaryLeastSqure and feed it with possible inputs and ouputs as

double[] input = {0.123,0.23,0.09}
double[] output = {98,0,0}
OrdinaryLeastSquares ols = new OrdinaryLeastSquares();
regression = ols.Learn(inputs, output);

As you can see number of inputs in array should match number of outputs, therefore, I fulfilled it with zeros. As a result, I got obvious wrong output. I cannot come up with a proper way of feeding my data to Linear Regression Class. I know that approach with fulfilling the array with zero's is wrong, but it is so far the only solution I came up with. I would appreciate if anyone tells me the way I should use regression in this case and helps in choosing a proper algorithm. Cheers!

Upvotes: 3

Views: 895

Answers (1)

Coke
Coke

Reputation: 985

After browsing different regression algorithms in Accord.NET, I came up with FanChenLinSupportVectorRegression, which was a part of the Accord.NET Machine Learning library. I believe, Fan Chen Lin was one of the major contributors of this algorithm, since it was called after his name.

Algorithm uses a concept of support vector regression (SVM).

FanChenLinSupportVectorRegression<TKernel>, where Kernelgets or sets the kernel function use to create a kernel Support Vector Machine. If this property is set, UseKernelEstimation will be set to false.

Regression function takes first input as an array, consisting of arrays of doubles (in our case weights of words in a certain post) and second an array of doubles, which consists of amount of likes.

IMPORTANT: sub-array of weights MUST correspond to the amount of likes in a second input in such a way that first sub-array has its like amount under [0] index in the likes array, second sub-array should have its like amount under [1] index in the likes array etc.

Example:

//Suppose those are posts with tf-idf weights
double[][] inputs =
{
  new[] { 3.0, 1.0 },
  new[] { 7.0, 1.0 },
  new[] { 3.0, 1.0 },
  new[] { 3.0, 2.0 },
  new[] { 6.0, 1.0 },
};
//amount of likes each corresponding post scored
double[] outputs = {2.0, 3.0, 4.0, 11.0, 6.0};
//Using FanChenLinSupportVectorRegression<Kernel>
var model = new FanChenLinSupportVectorRegression<Gaussian>();
//Train model and feed it with tf-idf of each post and corresponding like amount
var svm = model.Learn(inputs, outputs);
//Run a sample tf-idf input to get a prediction
double result = svm.Score(new double[] { 2.0,6.0});

I have tested this model with swapped inputs of the same value and results were pretty nice and accurate. Model works nice on big inputs as well, however requires more training. Hope this helps anybody in the future.

Upvotes: 1

Related Questions