JJ.Y
JJ.Y

Reputation: 325

regex -- split one column into multiple columns with no explicit delimiters in R

I have a column in my data set called "Market.Pair" that contains information about departure and destination points of some flights. For example:

input <- data.frame(Market.Pair = c("US to/from CA", "HOU to/from DFW/DAL", "EWR/JFK to/from LAX/SFO", "US-NYC to/from FR-PAR", "US to/from Asia"))
input

All two letter words represent countries(ex. US, CA). All three letter words(or multiple three letter words separated by "/") represent airports(ex. HOU, DFW/DAL). All words in the form of XX-XXX represent cities (ex. US-NYC). Other words represent regions, such as Asia, or Europe.

I would like to split this column into multiple columns:

output<- data.frame(Air.1 = c("HOU", "EWR/JFK", "", "", ""), Air.2 = c("DFW/DAL", "LAX/SFO", "", "", ""), City.1 = c("","","US-NYC", "", ""), City.2 = c("","","FR-PAR", "", ""), Country.1 = c("","","","US", "US"), Coutry.2 = c("","","","CA", ""), Region.1 = c("","","", "", "Asia"), Region.2 = c("","","", "", ""))
output

I am new to regex, so any help will be much appreciated!

Upvotes: 3

Views: 140

Answers (2)

A5C1D2H2I1M1N2O1R2T1
A5C1D2H2I1M1N2O1R2T1

Reputation: 193687

Here's a fairly manual approach, but it should still be quite efficient. It uses cSplit from my "splitstackshape" package to split the columns, then it subsets by condition using "data.table" to create new values by reference. Finally, it uses dcast (again from "data.table") to go into the wide format.

Here's some new sample data with the conditions you describe in your comment.

input <- data.frame(
  Market.Pair = c(
    "US to/from CA", "HOU to/from DFW/DAL",            # Your sample data
    "EWR/JFK to/from LAX/SFO", 
    "US-NYC to/from FR-PAR", "US to/from Asia", 
    "Latin America/Mexico to EMEA/India",              # Some only "to", exception to "/"
    "EWR to HKG/NRT, JFK to HKG"))                     # Some > 1 pair of values per row

Here's a possible approach:

library(splitstackshape)
## First, take care of data combined in single rows
x <- cSplit(input, "Market.Pair", ",", "long")

## Add indicator for row names
x[, rn := 1:nrow(x)]

## Split on to/from or to
x <- cSplit(x, "Market.Pair", " to/from | to ", "long", fixed = FALSE, 
            stripWhite = FALSE, type.convert = FALSE)

## Add a column named "type" filled with 'Region' as the value
x[, type := "Region"]

## Using your defined conditions, you can replace the values in the
##   'type' column by reference. Here's 'Air'...
x[nchar(Market.Pair) == 3 | grepl("^.../...$", Market.Pair), type := "Air"]

## ... here's 'Country'
x[nchar(Market.Pair) == 2, type := "Country"]

## ... and here's 'City'
x[grepl("^..-...$", Market.Pair), type := "City"]

## Add an indicator variable...
x[, ind := sequence(.N), by = .(rn, type)]

Now, you can reshape the data to a wide format using dcast from "data.table"

dcast(x, rn ~ type + ind, value.var = "Market.Pair", fill = "")
#    rn   Air_1   Air_2 City_1 City_2 Country_1 Country_2             Region_1   Region_2
# 1:  1                                      US        CA                                
# 2:  2     HOU DFW/DAL                                                                  
# 3:  3 EWR/JFK LAX/SFO                                                                  
# 4:  4                 US-NYC FR-PAR                                                    
# 5:  5                                      US                           Asia           
# 6:  6                                                   Latin America/Mexico EMEA/India
# 7:  7     EWR HKG/NRT                                                                  
# 8:  8     JFK     HKG                                                                  

Upvotes: 4

rawr
rawr

Reputation: 20811

input <- data.frame(Market.Pair = c("US to/from CA", "HOU to/from DFW/DAL",
                                    "EWR/JFK to/from LAX/SFO", "US-NYC to/from FR-PAR",
                                    "US to EMEA/India"))

sp <- strsplit(as.character(input$Market.Pair), '\\s+to(/from)?\\s+')

f <- Vectorize(function(x)
  if (grepl('\\-', x)) 'City' else if (nchar(x) == 2) 'Country' else
    if (grepl('^[A-Z]+/[A-Z]+$|^[A-Z]+$', x)) 'Air' else 'Region')

dd <- lapply(sp, function(x) {
  ## set up output matrix
  cn <- sort(levels(interaction(c('Air','City','Country','Region'), 1:2)))
  m <- matrix('', 1, length(cn), dimnames = list(NULL, cn))
  ## use f above and add the suffix
  xx <- f(x)
  nn <- setNames(x, paste(xx, ave(xx, xx, FUN = seq_along), sep = '.'))
  ## match
  m[, names(nn)] <- nn
  m
})
do.call('rbind.data.frame', dd)

#     Air.1   Air.2 City.1 City.2 Country.1 Country.2   Region.1 Region.2
# 1                                      US        CA                    
# 2     HOU DFW/DAL                                                      
# 3 EWR/JFK LAX/SFO                                                      
# 4                 US-NYC FR-PAR                                        
# 5                                      US           EMEA/India    

Upvotes: 3

Related Questions