Reputation: 23
I have some e-mail subject lines with their respective read rates as well as spam rates. Is it possible to use text mining in R to analyse the subject lines and the words/ pahses that contribute to the read/spam rates? Please advise..
E-mail subjects Spam % Read %
Hottest Hues for Spring! 3.00% 12.00%
New Styles Just for You 0.00% 17.00%
We've Got the Perfect Fit! 0.00% 19.00%
Save on Dresses and More! 5.00% 20.00%
More Online Deals Inside 2.04% 13.19%
Upvotes: 1
Views: 1866
Reputation: 54287
It's surely possible, like
df <- read.table(sep=";", header=T, quote="", text="
subject;spam;read
Hottest Hues for Spring!;3.00;12.00
New Styles Just for You;0.00;17.00
We've Got the Perfect Fit!;0.00;19.00
Save on Dresses and More!;5.00;20.00
More Online Deals Inside;2.04;13.19")
library(tm)
corp <- Corpus(VectorSource(df$subject))
dtm <- DocumentTermMatrix(corp, control=list(weighting=weightBin))
tmat_ <- as.matrix(dtm)
fit.spam <- lm(df$spam ~ tmat_)
summary(fit.spam)
But it will take much effort and know-how to get meaningful results and interpret them.
Update:
If the model fits the data, it will tell you which words yield higher read rates. Just as an artificial example:
set.seed(1)
weights <- c("hot"=2, "dresses"=5, "pants"=-3, "only"=1)
ic <- 6L
idx <- replicate(n=10, sample(1:length(weights), size=sample(1:length(weights), size=1)))
df <- as.data.frame(t(sapply(idx, function(x) {
cbind(paste(names(weights)[x], collapse=" "), sum(weights[x])+ic)
})))
names(df) <- c("subject", "read")
df$read <- as.numeric(as.character(df$read))
df
# subject read
# 1 dresses only 12
# 2 hot pants dresses only 11
# 3 hot only pants 6
# 4 dresses pants hot 10
# 5 only dresses pants 9
# 6 hot dresses only pants 11
# 7 hot dresses 13
# 8 dresses only pants hot 11
# 9 only 7
# 10 only hot dresses 14
library(tm)
corp <- Corpus(VectorSource(df$subject))
dtm <- DocumentTermMatrix(corp, control=list(weighting= weightTf))
tmat_ <- as.matrix(dtm)
fit.read <- lm(df$read ~ tmat_)
summary(fit.read)
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 6.000e+00 4.802e-16 1.249e+16 <2e-16 ***
# tmat_dresses 5.000e+00 3.199e-16 1.563e+16 <2e-16 ***
# tmat_hot 2.000e+00 3.116e-16 6.418e+15 <2e-16 ***
# tmat_only 1.000e+00 3.409e-16 2.934e+15 <2e-16 ***
# tmat_pants -3.000e+00 2.762e-16 -1.086e+16 <2e-16 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 3.862e-16 on 5 degrees of freedom
# Multiple R-squared: 1, Adjusted R-squared: 1
# F-statistic: 9.455e+31 on 4 and 5 DF, p-value: < 2.2e-16
Upvotes: 1
Reputation: 28274
+1 for matt_k's answer. basically the standard way to go: CRAN task views if you're exploring the field. So this is rather an addition to his links.
Plus, you might want to look at Mark van der Loo's page here who works in the field and provides some examples on approximate string matching. He's the author of the stringdist package.
Upvotes: 1
Reputation: 4489
You can use the tm package http://cran.r-project.org/web/packages/tm/index.html. Also, a good place to start is the task views on CRAN. In this case take a look at the Natural Language Processing task view http://cran.r-project.org/web/views/NaturalLanguageProcessing.html.
Upvotes: 1