Reputation: 65
I have a one column like this:
x <- c('WV West Virginia','FL Florida','CA California','SC South Carolina')
# [1] WV West Virginia FL Florida
# [3] CA California SC South Carolina
How can I separate the abbreviation from the whole state name. And I want to give the two new columns two different headers. I think I can only solve this by separating the all upper letter words away.
Upvotes: 1
Views: 392
Reputation: 109874
Here's a data.table/ gsub
approach:
x <- c('WV West Virginia','FL Florida','CA California','SC South Carolina')
data.table::data.table(x)[,
abb := gsub("(^[A-Z]{2})( .+)", "\\1", x)][,
state := gsub("(^[A-Z]{2})( .+)", "\\2", x)][]
## x abb state
## 1: WV West Virginia WV West Virginia
## 2: FL Florida FL Florida
## 3: CA California CA California
## 4: SC South Carolina SC South Carolina
Upvotes: 0
Reputation: 887148
Based on @rawr's comment, we could split
'x' at white space that follows the first two characters, i.e. showed by the regex lookaround ((?<=^.{2})
). The output will be a list
, which we rbind
, convert to data.frame
and then cbind
with the original vector 'x'.
cbind(x, as.data.frame(do.call(rbind,strsplit(x, '(?<=^.{2})\\s+', perl=TRUE)),
stringsAsFactors=FALSE))
# x V1 V2
#1 WV West Virginia WV West Virginia
#2 FL Florida FL Florida
#3 CA California CA California
#4 SC South Carolina SC South Carolina
Or instead of the regex lookaround, we could use stri_split
with n=2
and split at whitespace.
library(stringi)
cbind(x,as.data.frame(do.call(rbind,stri_split(x, regex='\\s+', n=2))))
Upvotes: 2
Reputation: 28441
With tidyr
we can use separate
to expand the column into two while specifying the new names. The argument extra=merge
limits the output to the given columns. The separator will default to non-alpha-numerics:
library(tidyr)
separate(df, x, c("Abb", "State"), extra="merge")
# Abb State
#1 WV West Virginia
#2 FL Florida
#3 CA California
#4 SC South Carolina
Data
x = c('WV West Virginia', 'FL Florida','CA California', 'SC South Carolina')
Upvotes: 4
Reputation: 66819
Use the state.*
constants that come with the base datasets package
DF = data.frame(raw=c("WV West Virginia","FL Florida","CA California","SC South Carolina"))
DF$state.abbr <- substr(DF$raw, 1, 2)
DF$state.name <- state.name[ match(DF$state.abbr, state.abb) ]
# raw state.abbr state.name
# 1 WV West Virginia WV West Virginia
# 2 FL Florida FL Florida
# 3 CA California CA California
# 4 SC South Carolina SC South Carolina
This way, you can afford to have typos or other oddities in the state names.
Upvotes: 3
Reputation: 13139
Two approaches without external packages:
Approach 1: you could use substring
in combination with nchar
.
dat <-data.frame(raw=c("WV West Virginia","FL Florida", "CA California","SC South Carolina"),
stringsAsFactors=F)
dat$code <- substr(dat$raw,1,2)
dat$state <- substr(dat$raw, 4, nchar(dat$raw))
> dat
raw code state
1 WV West Virginia WV West Virginia
2 FL Florida FL Florida
3 CA California CA California
4 SC South Carolina SC South Carolina
Approach two: you could use regular expressions to replace parts of your strings:
##approach two: regex
dat$code <- sub(" .+","",dat$raw)
dat$state <- sub("[A-Z]{2} ","",dat$raw)
Upvotes: 3
Reputation: 1545
Use the reshape2 package.
library(reshape2)
x <- rbind('WV West Virginia','FL Florida','CA California','SC South Carolina')
colsplit(x," ",c("Code","State"))
Output:
Code State
1 WV West Virginia
2 FL Florida
3 CA California
4 SC South Carolina
Upvotes: 2