user3151086
user3151086

Reputation: 23

Text Analysis using R

I have some e-mail subject lines with their respective read rates as well as spam rates. Is it possible to use text mining in R to analyse the subject lines and the words/ pahses that contribute to the read/spam rates? Please advise..

E-mail subjects          Spam % Read %
Hottest Hues for Spring!    3.00%   12.00%
New Styles Just for You     0.00%   17.00%
We've Got the Perfect Fit!  0.00%   19.00%
Save on Dresses and More!   5.00%   20.00%
More Online Deals Inside    2.04%   13.19%

Upvotes: 1

Views: 1866

Answers (3)

lukeA
lukeA

Reputation: 54287

It's surely possible, like

df <- read.table(sep=";", header=T, quote="", text="
subject;spam;read
Hottest Hues for Spring!;3.00;12.00
New Styles Just for You;0.00;17.00
We've Got the Perfect Fit!;0.00;19.00
Save on Dresses and More!;5.00;20.00
More Online Deals Inside;2.04;13.19")
library(tm)
corp <- Corpus(VectorSource(df$subject))
dtm <- DocumentTermMatrix(corp, control=list(weighting=weightBin))
tmat_ <- as.matrix(dtm)
fit.spam <- lm(df$spam ~ tmat_)
summary(fit.spam)

But it will take much effort and know-how to get meaningful results and interpret them.

Update:

If the model fits the data, it will tell you which words yield higher read rates. Just as an artificial example:

set.seed(1)
weights <- c("hot"=2, "dresses"=5, "pants"=-3, "only"=1) 
ic <- 6L
idx <- replicate(n=10, sample(1:length(weights), size=sample(1:length(weights), size=1)))
df <- as.data.frame(t(sapply(idx, function(x) {
  cbind(paste(names(weights)[x], collapse=" "), sum(weights[x])+ic)
})))
names(df) <- c("subject", "read")
df$read <- as.numeric(as.character(df$read))
df
#                   subject read
# 1            dresses only   12
# 2  hot pants dresses only   11
# 3          hot only pants    6
# 4       dresses pants hot   10
# 5      only dresses pants    9
# 6  hot dresses only pants   11
# 7             hot dresses   13
# 8  dresses only pants hot   11
# 9                    only    7
# 10       only hot dresses   14
library(tm)
corp <- Corpus(VectorSource(df$subject))
dtm <- DocumentTermMatrix(corp, control=list(weighting= weightTf))
tmat_ <- as.matrix(dtm)
fit.read <- lm(df$read ~ tmat_)
summary(fit.read)
#                  Estimate Std. Error    t value Pr(>|t|)    
#   (Intercept)   6.000e+00  4.802e-16  1.249e+16   <2e-16 ***
#   tmat_dresses  5.000e+00  3.199e-16  1.563e+16   <2e-16 ***
#   tmat_hot      2.000e+00  3.116e-16  6.418e+15   <2e-16 ***
#   tmat_only     1.000e+00  3.409e-16  2.934e+15   <2e-16 ***
#   tmat_pants   -3.000e+00  2.762e-16 -1.086e+16   <2e-16 ***
#   ---
#   Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# Residual standard error: 3.862e-16 on 5 degrees of freedom
# Multiple R-squared:      1,  Adjusted R-squared:      1 
# F-statistic: 9.455e+31 on 4 and 5 DF,  p-value: < 2.2e-16

Upvotes: 1

Matt Bannert
Matt Bannert

Reputation: 28274

+1 for matt_k's answer. basically the standard way to go: CRAN task views if you're exploring the field. So this is rather an addition to his links.

Plus, you might want to look at Mark van der Loo's page here who works in the field and provides some examples on approximate string matching. He's the author of the stringdist package.

Upvotes: 1

matt_k
matt_k

Reputation: 4489

You can use the tm package http://cran.r-project.org/web/packages/tm/index.html. Also, a good place to start is the task views on CRAN. In this case take a look at the Natural Language Processing task view http://cran.r-project.org/web/views/NaturalLanguageProcessing.html.

Upvotes: 1

Related Questions