Reputation: 811
I got a set of Twitter status updates and im trying to filter all direct messages, senders and receivers of the latter. My dataframe includes columns for senders and text. Using regular expressions I'm trying filter the receivers out of the text column.
this ist what I got, but its returning some strange results
WD <- getwd()
if (!is.null(WD)) setwd(WD)
load("data.R")
#http://www.unet.univie.ac.at/~a0406222/data.R
dmtext <- grep("^@[a-z0-9_]{1,15}", tweets$text, perl=T, value=TRUE,ignore.case=TRUE)
dm.receiver <- gsub("^@([a-z0-9_]{1,15})[ :,].*$", "\\1", dmtext, perl=T,ignore.case=TRUE)
dm.sender <- as.character(tweets$from_user[grep("^@[a-z0-9_]{1,15}", tweets$text, perl=T,ignore.case=TRUE,value=FALSE)])
dm.df <- data.frame(dm.sender,dm.receiver,dmtext)
dm.df[1:1000,2]
these are some examples of the bad results I get for dm.receiver
@insultaofuturo Apesar da proibição, jovens insistem em acampar no Aterro na Rio+20\nhttp://t.co/dCfFHUWV
@mqtodd Bringing the .green Internet to Rio+20 Summit | DotGreen\nhttp://t.co/pQqYilXp #RioPlus20 #gogreen
@Shyman33 Elinor Ostrom's trailblazing commons research can inspire Rio+20\n http://t.co/m7OTHBtP
@OccupyRio20 @pnud_es @FBuenAbad @rioplussocial #Futurewewant \nALGO DE ESTO SE HA CUMPLIDO? http://t.co/QDlVwT5z
@UNDP_CDG#UNDP#Asia-Pacific#Rio+20E-discussion on National&Local Planning for Sustainable Development. Contribute&mail:[email protected]
why is it that I get results longer than 15 characters using {1,15}?
Upvotes: 1
Views: 242
Reputation: 811
It turned out to be a encoding problem. I was not able to solve this issue using regular expressions but the software I used to retrieve the tweets has a column which indicates a user id tweets a adressed to. So I'll use this to do the analysis.
Upvotes: 1
Reputation: 6566
Your grep
command matches anything that starts with 1-15 alphanumeric characters. For example:
@blahblahblahblahblahblahblahblahblah
will match because grep looks for the start of the line, looks for the @, finds at least one alpha character and then happily stops, considering this a successful match. grep doesn't care what comes after your pattern in the string, as long as it found something that matches.
In order to get only things under 15 characters, you also have to specify what comes next:
dmtext <- grep("^@[a-z0-9_]{1,15}\\b", ...
This matches 1-15 characters, followed by a word boundary (\b
, with an extra backslash for string escaping). Thus, it won't match a word that is 16 or 100 characters long--only something with between 1 and 15 characters.
Upvotes: 0