Hugh
Hugh

Reputation: 16090

Separating character columns with common prefixes

I have a data frame like

 ddf <- data.frame(
  X = c("Fruit.Apple", "Fruit.Pear", 
        "Car.Mazda", "Car.Toyota", 
        "North.American.City.Chicago", "North.American.City.Ottawa", "North American.City.Toronto", "North.American.City.Los.Angeles", "Unique.Snowflake"), 
  Y = runif(9)  # doesn't matter
  )

                            X         Y
1                     Fruit.Apple 0.2655087
2                      Fruit.Pear 0.3721239
3                       Car.Mazda 0.5728534
4                      Car.Toyota 0.9082078
5     North.American.City.Chicago 0.2016819
6      North.American.City.Ottawa 0.8983897
7     North.American.City.Toronto 0.9446753
8 North.American.City.Los.Angeles 0.6607978
9                Unique.Snowflake 0.6291140

and I want:

                  X.1              X.2         Y
1               Fruit            Apple 0.2655087
2               Fruit             Pear 0.3721239
3                 Car            Mazda 0.5728534
4                 Car           Toyota 0.9082078
5 North.American.City          Chicago 0.2016819
6 North.American.City           Ottawa 0.8983897
7 North.American.City          Toronto 0.9446753
8 North.American.City      Los.Angeles 0.6607978
9                <NA> Unique.Snowflake 0.6291140

I'm not totally convinced my problem is soluble, but it seems to have a pattern. I'm completing stumped for a solution. It would be easy if prefixes were easily separated but as the North.American.City example shows, sometimes the prefix contains the separating character. It would be relatively straightforward if the suffix didn't contain the . but Los.Angeles shouldn't be separated. I also only want true prefixes to appear in X.1 as I've demonstrated with Unique.Snowflake. My only thought is to create new column of all text between . using gsub("(.*)\\..*$", "\\1", ...) and some nested for-loops to work out which ones are prefixes, but there must be a better way.

Upvotes: 1

Views: 84

Answers (2)

MrFlick
MrFlick

Reputation: 206197

OK. This is a lot of messy code but it does the job. I'm sure others can come up with more elegant solutions.

#sample vector for splitting
X = c("Fruit.Apple", "Fruit.Pear", 
    "Car.Mazda", "Car.Toyota", 
    "North.American.City.Chicago", "North.American.City.Ottawa", 
    "North.American.City.Toronto", "North.American.City.Los.Angeles",
    "Unique.Snowflake"
)

#split on "." and prepare candidates
parts<-strsplit(X,".", fixed=T)
scores<-lapply(parts, function(p) {
    lp<-length(p)
    list(
        c("",sapply(seq.int(p), function(x) paste(p[1:x], collapse="."))),
        c(sapply(seq.int(p), function(x) paste(p[x:lp], collapse=".")),""),
        seq.int(lp+1)
    )
});

#now combine considerations
options<-do.call(rbind, lapply(seq.int(scores), function(i) 
    data.frame(
        item=i, 
        prefix=scores[[i]][[1]], 
        suffix=scores[[i]][[2]],
        depth=scores[[i]][[3]]))
    )
#now add the freq score across all categories
options$freq=ave(rep.int(1,nrow(options)),options$prefix, FUN=length)

#finally, select the longest prefix combination that occurs >1 times
best<-do.call(rbind, by(options, options$item, function(x) {
    x[order(x$freq<=1, -x$depth), ][1,]
}))
best[,2:3];

And this results in

               prefix           suffix
1               Fruit            Apple
2               Fruit             Pear
3                 Car            Mazda
4                 Car           Toyota
5 North.American.City          Chicago
6 North.American.City           Ottawa
7 North.American.City          Toronto
8 North.American.City      Los.Angeles
9                     Unique.Snowflake

Upvotes: 2

IRTFM
IRTFM

Reputation: 263332

 ddf2 <- gsub("\\.([[:alpha:]]*)$", " \\1", ddf$X)
 data.frame( read.table(text=ddf2), Y=ddf$Y)
#------------
                   V1      V2         Y
1               Fruit   Apple 0.8097010
2               Fruit    Pear 0.6934737
3                 Car   Mazda 0.2143207
4                 Car  Toyota 0.5036963
5 North.American.City Chicago 0.7364826
6 North.American.City  Ottawa 0.8603377
7 North.American.City Toronto 0.4751705

(Your example construction code doesn't make the line with "snowflake. Nor do you describe why we would not split at the last period..)

Upvotes: 0

Related Questions