Arshdeep
Arshdeep

Reputation: 4323

How to find keywords (useful words) from text?

I am doing an experimental project.

What i am trying to achieve is, i want to find that what are the keywords in that text.

How i am trying to do this is i make a list of how many times a word appear in the text sorted by most used words at top.

But problem is some common words like is,was,were are always at top. Apparently these are not worth.

Can you people suggest me some good logic to do it, so it finds good related keywords always?

Upvotes: 2

Views: 5131

Answers (3)

posdef
posdef

Reputation: 6532

my first approach to something like this would be more mathematical modeling than pure programming.

there are two "simple" ways you can attack a problem like this; a) exclusion list (penalize a collection of words which you deem useless) b) use a weight function, which for ex. builds on the word length, thus small words such as prepositions (in, at...) and pronouns (I,you,me,his... ) will be penalized and hopefully fall mid-table

I am not sure if this was what you were looking for, but I hope it helps. By the way, I know that contextual text processing is a subject of active research, you might find a number of projects which may be interesting.

Upvotes: 0

Mark Baker
Mark Baker

Reputation: 212412

Use something like a Brill Parser to identify the different parts of speech, like nouns. Then extract only the nouns, and sort them by frequency.

Upvotes: 6

GordonM
GordonM

Reputation: 31730

Well you could use preg_split to get the list of words and how often they occur, I'm assuming that that's the bit you've got working so far.

Only thing I could think of regarding stripping the non-important words is to have a dictionary of words you want to ignore, containing "a", "I", "the", "and", etc. Use this dictionary to filter out the unwanted words.

Why are you doing this, is it for searching page content? If it is, then most back end databases offer some kind of text search functionality, both MySQL and Postgres have a fulltext search engine, for example, that automatically discards the unimportant words. I'd recommend using the fulltext features of the backend database you're using, as chances are they're already implementing something that meets your requirements.

Upvotes: 1

Related Questions