Reputation: 43
I am planning to do text analysis in R just as sentiment analysis with an own custom dictionary following a "trade" versus "law" logic.
I have all the required words for the dictionary in an excel file. Looks like this:
> % 1 Trade 2 Law % business 1 exchange 1 industry 1 rule 2
> settlement 2 umpire 2 court 2 tribunal 2 lawsuit 2 bench 2
> courthouse 2 courtroom 2
What steps do I have to pursue in order to transform this in an R-suitable format and apply it to my text corpus?
Thank you for your help!
Upvotes: 1
Views: 827
Reputation: 23608
Create a data.frame with 2 columns and store this somewhere, either as an rds, a database object or in excel. So you can load it everytime when needed.
Once you have the data in a data.frame you can use joins /dictionaries to match it to the words in your text corpus. In the scoring data.frame I used 1 and 2 to represent the sectors, but you can use words as well.
See example using tidytext, but read up on sentiment analyses and use whatever package you need to.
library(tidytext)
library(dplyr)
text_df <- data.frame(id = 1:2,
text = c("The business is in the mining industry and has a settlement.",
"The court ordered the business owner to settle the lawsuit."))
text_df %>%
unnest_tokens(word, text) %>%
inner_join(my_scoring_df)
Joining, by = "word"
id word sector
1 1 business 1
2 1 industry 1
3 1 settlement 2
4 2 court 2
5 2 business 1
6 2 lawsuit 2
Data:
my_scoring_df <- structure(list(word = c("business", "exchange", "industry", "rule",
"settlement", "umpire", "court", "tribunal", "lawsuit", "bench",
"courthouse", "courtroom"), sector = c(1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L)), class = "data.frame", row.names = c(NA,
-12L))
Upvotes: 1