Magnus
Magnus

Reputation: 760

Can I create a vector with regexps?

My data looks somthing like this:

412 U CA, Riverside
413 U British Columbia
414 CREI
415 U Pompeu Fabra
416 Office of the Comptroller of the Currency, US Department of the Treasury
417 Bureau of Economics, US Federal Trade Commission
418 U Carlos III de Madrid
419 U Brescia
420 LUISS Guido Carli
421 U Alicante
422 Harvard Society of Fellows
423 Toulouse School of Economics
424 Decision Economics Inc, Boston, MA
425 ECARES, Free U Brussels

I will need to geocode this data in order to get the coordinates for each specific institution. in order to do that I need all state names to be spelled out. At the same time I don't want acronyms like "ECARES" to be transformed into "ECaliforniaRES".

I have been toying with the idea of converting the state.abb and state.name vectors into vectors of regular expressions, so that state.abb would look something like this (Using Alabama and California as state 1 and state 2):

c("^AL "|" AL "|" AL,"|",AL "| " AL$", "^CA "[....])

And the state.name vector something like this:

c("^Alabama "|" Alabama "|" Alabama,"|",Alabama "| " Alabama$", "^California "[....])

Hopefully, I can then use the mgsub function to replace all expressions in the modified state.abb vector with the corresponding entries in the modified state.name vector.

For some reason, however, it doesn't seem to be possible to put regexps in a vector:

    test<-c(^AL, ^AB)
Error: unexpected '^' in "test<-c(^"

I have tried excusing the "^"-signs but this doesnt really seem to work:

test<-c(\^AL, \^AB)
Error: unexpected input in "test<-c(\"
> test<-c(\\^AL, \\^AB)

Is there any way of putting regexps in a vector, or is there another way of achieving my goal (that is, to replace all two-letter state abbreviations to state names without messing up other acronyms in the process)?

Excerpt of my data:

c("U Lausanne", "Swiss Finance Institute", "U CA, Riverside", 
"U British Columbia", "CREI", "U Pompeu Fabra", "Office of the Comptroller of the Currency, US Department of the Treasury", 
"Bureau of Economics, US Federal Trade Commission", "U Carlos III de Madrid", 
"U Brescia", "LUISS Guido Carli", "U Alicante", "Harvard Society of Fellows", 
"Toulouse School of Economics", "Decision Economics Inc, Boston, MA", 
"ECARES, Free U Brussels", "Baylor U", "Research Centre for Education", 
"the Labour Market, Maastricht U", "U Bonn", "Swarthmore College"
)

Upvotes: 2

Views: 93

Answers (1)

akrun
akrun

Reputation: 887048

We can make use of the state.abb vector and paste it together by collapseing with |

pat1 <- paste0("\\b(", paste(state.abb, collapse="|"), ")\\b")

The \\b signifies the word boundary so that indiscriminate matches "SAL" can be avoided

and similarly with state.name, paste the ^ and $ as prefix/suffix to mark the start, end of the string respectively

pat2 <- paste0("^(", paste(state.name, collapse="|"), ")$")

Upvotes: 5

Related Questions