Reputation: 16090
I have a data frame like
ddf <- data.frame(
X = c("Fruit.Apple", "Fruit.Pear",
"Car.Mazda", "Car.Toyota",
"North.American.City.Chicago", "North.American.City.Ottawa", "North American.City.Toronto", "North.American.City.Los.Angeles", "Unique.Snowflake"),
Y = runif(9) # doesn't matter
)
X Y
1 Fruit.Apple 0.2655087
2 Fruit.Pear 0.3721239
3 Car.Mazda 0.5728534
4 Car.Toyota 0.9082078
5 North.American.City.Chicago 0.2016819
6 North.American.City.Ottawa 0.8983897
7 North.American.City.Toronto 0.9446753
8 North.American.City.Los.Angeles 0.6607978
9 Unique.Snowflake 0.6291140
and I want:
X.1 X.2 Y
1 Fruit Apple 0.2655087
2 Fruit Pear 0.3721239
3 Car Mazda 0.5728534
4 Car Toyota 0.9082078
5 North.American.City Chicago 0.2016819
6 North.American.City Ottawa 0.8983897
7 North.American.City Toronto 0.9446753
8 North.American.City Los.Angeles 0.6607978
9 <NA> Unique.Snowflake 0.6291140
I'm not totally convinced my problem is soluble, but it seems to have a pattern. I'm completing stumped for a solution. It would be easy if prefixes were easily separated but as the North.American.City
example shows, sometimes the prefix contains the separating character. It would be relatively straightforward if the suffix didn't contain the .
but Los.Angeles
shouldn't be separated. I also only want true prefixes to appear in X.1
as I've demonstrated with Unique.Snowflake
. My only thought is to create new column of all text between .
using gsub("(.*)\\..*$", "\\1", ...)
and some nested for-loops to work out which ones are prefixes, but there must be a better way.
Upvotes: 1
Views: 84
Reputation: 206197
OK. This is a lot of messy code but it does the job. I'm sure others can come up with more elegant solutions.
#sample vector for splitting
X = c("Fruit.Apple", "Fruit.Pear",
"Car.Mazda", "Car.Toyota",
"North.American.City.Chicago", "North.American.City.Ottawa",
"North.American.City.Toronto", "North.American.City.Los.Angeles",
"Unique.Snowflake"
)
#split on "." and prepare candidates
parts<-strsplit(X,".", fixed=T)
scores<-lapply(parts, function(p) {
lp<-length(p)
list(
c("",sapply(seq.int(p), function(x) paste(p[1:x], collapse="."))),
c(sapply(seq.int(p), function(x) paste(p[x:lp], collapse=".")),""),
seq.int(lp+1)
)
});
#now combine considerations
options<-do.call(rbind, lapply(seq.int(scores), function(i)
data.frame(
item=i,
prefix=scores[[i]][[1]],
suffix=scores[[i]][[2]],
depth=scores[[i]][[3]]))
)
#now add the freq score across all categories
options$freq=ave(rep.int(1,nrow(options)),options$prefix, FUN=length)
#finally, select the longest prefix combination that occurs >1 times
best<-do.call(rbind, by(options, options$item, function(x) {
x[order(x$freq<=1, -x$depth), ][1,]
}))
best[,2:3];
And this results in
prefix suffix
1 Fruit Apple
2 Fruit Pear
3 Car Mazda
4 Car Toyota
5 North.American.City Chicago
6 North.American.City Ottawa
7 North.American.City Toronto
8 North.American.City Los.Angeles
9 Unique.Snowflake
Upvotes: 2
Reputation: 263332
ddf2 <- gsub("\\.([[:alpha:]]*)$", " \\1", ddf$X)
data.frame( read.table(text=ddf2), Y=ddf$Y)
#------------
V1 V2 Y
1 Fruit Apple 0.8097010
2 Fruit Pear 0.6934737
3 Car Mazda 0.2143207
4 Car Toyota 0.5036963
5 North.American.City Chicago 0.7364826
6 North.American.City Ottawa 0.8603377
7 North.American.City Toronto 0.4751705
(Your example construction code doesn't make the line with "snowflake. Nor do you describe why we would not split at the last period..)
Upvotes: 0