Separating character columns with common prefixes

Question

I have a data frame like

 ddf <- data.frame(
  X = c("Fruit.Apple", "Fruit.Pear", 
        "Car.Mazda", "Car.Toyota", 
        "North.American.City.Chicago", "North.American.City.Ottawa", "North American.City.Toronto", "North.American.City.Los.Angeles", "Unique.Snowflake"), 
  Y = runif(9)  # doesn't matter
  )

                            X         Y
1                     Fruit.Apple 0.2655087
2                      Fruit.Pear 0.3721239
3                       Car.Mazda 0.5728534
4                      Car.Toyota 0.9082078
5     North.American.City.Chicago 0.2016819
6      North.American.City.Ottawa 0.8983897
7     North.American.City.Toronto 0.9446753
8 North.American.City.Los.Angeles 0.6607978
9                Unique.Snowflake 0.6291140

and I want:

                  X.1              X.2         Y
1               Fruit            Apple 0.2655087
2               Fruit             Pear 0.3721239
3                 Car            Mazda 0.5728534
4                 Car           Toyota 0.9082078
5 North.American.City          Chicago 0.2016819
6 North.American.City           Ottawa 0.8983897
7 North.American.City          Toronto 0.9446753
8 North.American.City      Los.Angeles 0.6607978
9                 Unique.Snowflake 0.6291140

I'm not totally convinced my problem is soluble, but it seems to have a pattern. I'm completing stumped for a solution. It would be easy if prefixes were easily separated but as the North.American.City example shows, sometimes the prefix contains the separating character. It would be relatively straightforward if the suffix didn't contain the . but Los.Angeles shouldn't be separated. I also only want true prefixes to appear in X.1 as I've demonstrated with Unique.Snowflake. My only thought is to create new column of all text between . using gsub("(.*)\..*$", "\1", ...) and some nested for-loops to work out which ones are prefixes, but there must be a better way.

MrFlick · Accepted Answer

OK. This is a lot of messy code but it does the job. I'm sure others can come up with more elegant solutions.

#sample vector for splitting
X = c("Fruit.Apple", "Fruit.Pear", 
    "Car.Mazda", "Car.Toyota", 
    "North.American.City.Chicago", "North.American.City.Ottawa", 
    "North.American.City.Toronto", "North.American.City.Los.Angeles",
    "Unique.Snowflake"
)

#split on "." and prepare candidates
parts<-strsplit(X,".", fixed=T)
scores<-lapply(parts, function(p) {
    lp<-length(p)
    list(
        c("",sapply(seq.int(p), function(x) paste(p[1:x], collapse="."))),
        c(sapply(seq.int(p), function(x) paste(p[x:lp], collapse=".")),""),
        seq.int(lp+1)
    )
});

#now combine considerations
options<-do.call(rbind, lapply(seq.int(scores), function(i) 
    data.frame(
        item=i, 
        prefix=scores[[i]][[1]], 
        suffix=scores[[i]][[2]],
        depth=scores[[i]][[3]]))
    )
#now add the freq score across all categories
options$freq=ave(rep.int(1,nrow(options)),options$prefix, FUN=length)

#finally, select the longest prefix combination that occurs >1 times
best<-do.call(rbind, by(options, options$item, function(x) {
    x[order(x$freq<=1, -x$depth), ][1,]
}))
best[,2:3];

And this results in

               prefix           suffix
1               Fruit            Apple
2               Fruit             Pear
3                 Car            Mazda
4                 Car           Toyota
5 North.American.City          Chicago
6 North.American.City           Ottawa
7 North.American.City          Toronto
8 North.American.City      Los.Angeles
9                     Unique.Snowflake

Separating character columns with common prefixes

Answers (2)

Related Questions