Reputation: 11
When reading on methods of textual analysis, some eliminate documents with "10% lowest density score", that is, documents that are relatively long compared to the occurrence of a certain keyword. How can I achieve a similar result in quanteda?
I've created a corpus using a query of the words "refugee" and "asylum seeker". Now I would like to remove all documents where the count frequency of refugee|asylum_seeker is below 3. However, I imagine it is also possible to use the relative frequency if document length is to be taken into account.
Could someone help me? The solution in my head looks like this, however I don't know how to implement it.
For count frequency: Add counts of occurrences of refugee|asylum_seeker per document and remove documents with an added count below 3.
For relative frequency: Inspect the overall average relative frequency of both words refugee and asylum_seeker, to then calculate the per row relative frequencies of the features and apply a function to remove all documents with a relative frequency of both features below X.
Upvotes: 0
Views: 67
Reputation: 14902
Create a dfm from your tokenised corpus, using dfmat <- dfm(your tokens)
.
Remove the documents features this way:
dfm_remove(dfmat,
as.logical(dfmat[, c("refugee")] < 3 |
dfmat[, c("asylum_seeker")] < 3)
)
Upvotes: 0