Mo J. Mughrabi
Mo J. Mughrabi

Reputation: 6997

Guess tags of a paragraph programmatically using python

I've trying to read about NLP in general and nltk in specific to use with python. I don't know for sure if what am looking for exists out there, or if I perhaps need to develop it.

I have a program that collect text from different files, the text is extremely random and talks about different things. Each file contains a paragraph or 3 maximum, my program opens the files and store them into a table.

My question is, can i guess tags of what the paragraph is about? if anyone knows of an existing technology or approach, I would really appreciate it.

Thanks,

Upvotes: 2

Views: 395

Answers (2)

alexis
alexis

Reputation: 50190

Your task is called "document classification", and the nltk book has a whole chapter on it. I'd start with that.

It all depends on your criteria for assigning tags. Are you interested in matching your documents against a pre-existing set of tags, or perhaps in topic extraction (select the N most important words or phrases in the text)?

Upvotes: 1

luke14free
luke14free

Reputation: 2539

You should train a classifier, the easiest one to develop (and you don't really need to develop it as NLTK provides one) is the naive baesian. The problem is that you'll need to classify manually a corpus of observations and then have the program guess what tag best fits a given paragraph (needless to say that the bigger the training corpus the more precise will be your classifier, IMHO you can reach a 80-85% of correctness). Take a look at the docs.

Upvotes: 0

Related Questions