Reputation: 970
i need to extract domain-specific terms from a big training corpus, such as political terms or etc .how can i use Weka and it's filters to aim this object?
can i use feature vector produced by StringToVector()
filter in Weka to do this or not?
Upvotes: 0
Views: 236
Reputation: 1061
You can at least partly, as far as you have an appropriate dataset. For instance, let us assume you have a dataset like this one:
@relation test
@attribute text String
@attribute politics {yes,no}
@attribute religion {yes,no}
@data
"this is a text about politics",yes,no
"this text is about religion",no,yes
"this text mixes everything",yes,yes
For instance, for getting terms about politics, you can:
StringToWordVector
filter to the text attribute to get terms.AttributeSelection
filter with Ranker
and InfoGainAttributeEval
to get the top ranked terms.This latter step will give you a list of terms that are most predictive for the politics category. Most of them will be terms in the politics domain (although it is possible that some terms are predictive but just because they are not in the politics domain - that is, they provide negative evidence).
The quality of the terms you get depens on the dataset. The more topics it deals with, the better for your results; so instead of having two classes (politics, religion, like in my dataset), it is much better to have plenty of them and many examples for each category.
Upvotes: 1