Reputation: 325
I have a column in my data set called "Market.Pair" that contains information about departure and destination points of some flights. For example:
input <- data.frame(Market.Pair = c("US to/from CA", "HOU to/from DFW/DAL", "EWR/JFK to/from LAX/SFO", "US-NYC to/from FR-PAR", "US to/from Asia"))
input
All two letter words represent countries(ex. US, CA). All three letter words(or multiple three letter words separated by "/") represent airports(ex. HOU, DFW/DAL). All words in the form of XX-XXX represent cities (ex. US-NYC). Other words represent regions, such as Asia, or Europe.
I would like to split this column into multiple columns:
output<- data.frame(Air.1 = c("HOU", "EWR/JFK", "", "", ""), Air.2 = c("DFW/DAL", "LAX/SFO", "", "", ""), City.1 = c("","","US-NYC", "", ""), City.2 = c("","","FR-PAR", "", ""), Country.1 = c("","","","US", "US"), Coutry.2 = c("","","","CA", ""), Region.1 = c("","","", "", "Asia"), Region.2 = c("","","", "", ""))
output
I am new to regex, so any help will be much appreciated!
Upvotes: 3
Views: 140
Reputation: 193687
Here's a fairly manual approach, but it should still be quite efficient. It uses cSplit
from my "splitstackshape" package to split the columns, then it subsets by condition using "data.table" to create new values by reference. Finally, it uses dcast
(again from "data.table") to go into the wide format.
Here's some new sample data with the conditions you describe in your comment.
input <- data.frame(
Market.Pair = c(
"US to/from CA", "HOU to/from DFW/DAL", # Your sample data
"EWR/JFK to/from LAX/SFO",
"US-NYC to/from FR-PAR", "US to/from Asia",
"Latin America/Mexico to EMEA/India", # Some only "to", exception to "/"
"EWR to HKG/NRT, JFK to HKG")) # Some > 1 pair of values per row
Here's a possible approach:
library(splitstackshape)
## First, take care of data combined in single rows
x <- cSplit(input, "Market.Pair", ",", "long")
## Add indicator for row names
x[, rn := 1:nrow(x)]
## Split on to/from or to
x <- cSplit(x, "Market.Pair", " to/from | to ", "long", fixed = FALSE,
stripWhite = FALSE, type.convert = FALSE)
## Add a column named "type" filled with 'Region' as the value
x[, type := "Region"]
## Using your defined conditions, you can replace the values in the
## 'type' column by reference. Here's 'Air'...
x[nchar(Market.Pair) == 3 | grepl("^.../...$", Market.Pair), type := "Air"]
## ... here's 'Country'
x[nchar(Market.Pair) == 2, type := "Country"]
## ... and here's 'City'
x[grepl("^..-...$", Market.Pair), type := "City"]
## Add an indicator variable...
x[, ind := sequence(.N), by = .(rn, type)]
Now, you can reshape the data to a wide format using dcast
from "data.table"
dcast(x, rn ~ type + ind, value.var = "Market.Pair", fill = "")
# rn Air_1 Air_2 City_1 City_2 Country_1 Country_2 Region_1 Region_2
# 1: 1 US CA
# 2: 2 HOU DFW/DAL
# 3: 3 EWR/JFK LAX/SFO
# 4: 4 US-NYC FR-PAR
# 5: 5 US Asia
# 6: 6 Latin America/Mexico EMEA/India
# 7: 7 EWR HKG/NRT
# 8: 8 JFK HKG
Upvotes: 4
Reputation: 20811
input <- data.frame(Market.Pair = c("US to/from CA", "HOU to/from DFW/DAL",
"EWR/JFK to/from LAX/SFO", "US-NYC to/from FR-PAR",
"US to EMEA/India"))
sp <- strsplit(as.character(input$Market.Pair), '\\s+to(/from)?\\s+')
f <- Vectorize(function(x)
if (grepl('\\-', x)) 'City' else if (nchar(x) == 2) 'Country' else
if (grepl('^[A-Z]+/[A-Z]+$|^[A-Z]+$', x)) 'Air' else 'Region')
dd <- lapply(sp, function(x) {
## set up output matrix
cn <- sort(levels(interaction(c('Air','City','Country','Region'), 1:2)))
m <- matrix('', 1, length(cn), dimnames = list(NULL, cn))
## use f above and add the suffix
xx <- f(x)
nn <- setNames(x, paste(xx, ave(xx, xx, FUN = seq_along), sep = '.'))
## match
m[, names(nn)] <- nn
m
})
do.call('rbind.data.frame', dd)
# Air.1 Air.2 City.1 City.2 Country.1 Country.2 Region.1 Region.2
# 1 US CA
# 2 HOU DFW/DAL
# 3 EWR/JFK LAX/SFO
# 4 US-NYC FR-PAR
# 5 US EMEA/India
Upvotes: 3