sns
sns

Reputation: 321

best algorithm to predict 3 similar blogs based on a blog props and contents only

{ "blogid": 11, "blog_authorid": 2, "blog_content": "(this is blog complete content: html encoded on base64 such as) PHNlY3Rpb24+PGRpdiBjbGFzcz0icm93Ij4KICAgICAgICA8ZGl2IGNsYXNzPSJjb2wtc20tMTIiIGRhdGEtdHlwZT0iY29udGFpbmVyLWNvbnRlbn", "blog_timestamp": "2018-03-17 00:00:00", "blog_title": "Amazon India Fashion Week: Autumn-", "blog_subtitle": "", "blog_featured_img_link": "link to image", "blog_intropara": "Introductory para to article", "blog_status": 1, "blog_lastupdated": "\"Mar 19, 2018 7:42:23 AM\"", "blog_type": "Blog", "blog_tags": "1,4,6", "blog_uri": "Amazon-India-Fashion-Week-Autumn", "blog_categories": "1", "blog_readtime": "5", "ViewsCount": 0 }

Above is one sample blog as per my API. I have a JsonArray of such blogs.

I am trying to predict 3 similar blogs based on a blog's props(eg: tags,categories,author,keywords in title/subtitle) and contents. I have no user data i.e, there is no logged in user data(such as rating or review). I know that without user's data it will not be accurate but I'm just getting started with data science or ML. Any suggestion/link is appreciated. I prefer using java but python,php or any other lang also works for me. I need an easy to implement model as I am a beginner. Thanks in advance.

Upvotes: 0

Views: 58

Answers (1)

AChervony
AChervony

Reputation: 663

My intuition is that this question might not be at the right address.

BUT

I would do the following:

  1. Create a dataset of sites that would be an inventory from which to predict. For each site you will need to list one or more features: Amount of tags, amount of posts, average time between posts in days, etc.
    Sounds like this is for training and you are not worried about accuracy too much, numeric features should suffice.
  2. Work back from a k-NN algorithm. Don't worry about the classifiers. Instead of classifying a blog, you list the 3 closest neighbors (k = 3). A good implementation of the algorithm is here. Have fun simplifying it for your purposes.

Your algorithm should be a step or two shorter than k-NN which is considered to be among simpler ML, a good place to start.

Good luck.

EDIT:

You want to build a recommender engine using text, tags, numeric and maybe time series data. This is a broad request. Just like you, when faced with this request, I’d need to dive in the data and research best approach. Some approaches require different sets of data. E.g. Collaborative vs Content-based filtering.

  • Few things may’ve been missed on the user side that can be used like a sort of rating: You do not need a login feature get information: Cookie ID or IP based DMA, GEO and viewing duration should be available to the Web Server.
  • On the Blog side: you need to process the texts to identify related terms. Other blog features I gave examples above.

I am aware that this is a lot of hand-waving, but there’s no actual code question here. To reiterate my intuition is that this question might not be at the right address. I really want to help but this is the best I can do.

EDIT 2:

If I understand your new comments correctly, each blog has the following for each other blog:

  • A Jaccard similarity coefficient.
  • A set of TF-IDF generated words with scores.
  • A Euclidean distance based on numeric data.

I would create a heuristic from these and allow the process to adjust the importance of each statistic.
The challenge would be to quantify the words-scores TF-IDF output. You can treat those (over a certain score) as tags and run another similarity analysis, or count overlap.

You already started on this path, and this answer assumes you are to continue. IMO best path is to see which dedicated recommender engines can help you without constructing statistics piecemeal (numeric w/ Euclidean, tags w/ Jaccard, Text w/ TF-IDF).

Upvotes: 1

Related Questions