lmcshane
lmcshane

Reputation: 1114

Extract location data using regex in R

R newbie here I have data that looks something like this:

{'id': 19847005, 'profile_sidebar_fill_color': u'http://pbs.foo.com/profile_background', 'profile_text_color': u'333333', 'followers_count': 1105, 'location': u'San Diego, CA', 'profile_background_color': u'9AE4E8', 'listed_count': 43, '009', 'time_zone': u'Pacific Time (US & Canada)', 'protected': False}

I want to extract the location data from this text: San Diego, CA.

I have been trying to use this stringr package to accomplish this, but can't quite get the regex right to capture the city and state. Sometimes state will be present, other times not present.

location_pattern <- "'location':\su'(\w+)'"
rawdata$location <- str_extract(rawdata$user, location_pattern)

Upvotes: 0

Views: 279

Answers (3)

Rich Scriven
Rich Scriven

Reputation: 99361

It looks like a json string, but if you're not too concerned about that, then perhaps this would help.

library(stringi)

ss <- stri_split_regex(x, "[{}]|u?'|(, '(009')?)|: ", omit=TRUE)[[1]]
(m <- matrix(ss, ncol = 2, byrow = TRUE))
#      [,1]                         [,2]                                   
# [1,] "id"                         "19847005"                             
# [2,] "profile_sidebar_fill_color" "http://pbs.foo.com/profile_background"
# [3,] "profile_text_color"         "333333"                               
# [4,] "followers_count"            "1105"                                 
# [5,] "location"                   "San Diego, CA"                        
# [6,] "profile_background_color"   "9AE4E8"                               
# [7,] "listed_count"               "43"                                   
# [8,] "time_zone"                  "Pacific Time (US & Canada)"           
# [9,] "protected"                  "False"                            

So now you have the ID names in the left column and the values on the right. It would probably be simple to reassemble the json from this point if need be.

Also, regarding the json-ness, we can coerce m to a data.frame (or leave it as a matrix), and then use jsonlite::toJSON

library(jsonlite)
json <- toJSON(setNames(as.data.frame(m), c("ID", "Value")))
fromJSON(json)
#                           ID                                 Value
# 1                         id                              19847005
# 2 profile_sidebar_fill_color http://pbs.foo.com/profile_background
# 3         profile_text_color                                333333
# 4            followers_count                                  1105
# 5                   location                         San Diego, CA
# 6   profile_background_color                                9AE4E8
# 7               listed_count                                    43
# 8                  time_zone            Pacific Time (US & Canada)
# 9                  protected                                 False

Upvotes: 2

Greg Snow
Greg Snow

Reputation: 49650

Others have given possible solutions, but not explained what likely went wrong with your attempt.

The str_extract function uses POSIX extended regular expressions that do not understand \w and \s, those are specific to Perl regular expressions. You can use the perl function in the stringr package instead and it will then recognize the shortcuts, or you can use [[:space:]] in place of \s and [[:alnum:]_] in place of \w though more likely you will want something like [[:alpha], ] or [^'].

Also, R's string parser gets a shot at the string before it is passed to the matching function, therefore you will need \\s and \\w if you use the perl function (or other regular expressions function in R). the first \ escapes the second so that a single \ remains in the string to be interpreted as part of the regular expression.

Upvotes: 2

akrun
akrun

Reputation: 887601

You could try

str_extract_all(str1, perl("(?<=location.: u.)[^']+(?=')"))[[1]]
#[1] "San Diego, CA"

Upvotes: 2

Related Questions