Reputation: 945
I have a data frame where some columns have the same data, but different column names. I would like to remove duplicated columns, but merge the column names. An example, where test1 and test4 columns are duplicates:
df
test1 test2 test3 test4
1 1 1 0 1
2 2 2 2 2
3 3 4 4 3
4 4 4 4 4
5 5 5 5 5
6 6 6 6 6
and I would like the result to be something like this:
df
test1+test4 test2 test3
1 1 1 0
2 2 2 2
3 3 4 4
4 4 4 4
5 5 5 5
6 6 6 6
Here is the data:
structure(list(test1 = c(1, 2, 3, 4, 5, 6), test2 = c(1, 2, 4,
4, 5, 6), test3 = c(0, 2, 4, 4, 5, 6), test4 = c(1, 2, 3, 4,
5, 6)), .Names = c("test1", "test2", "test3", "test4"), row.names = c(NA,
-6L), class = "data.frame")
Please note that I do not simply want to remove duplicated columns. I also want to merge the column names of the duplicated columns, after the duplicates are removed.
I could do it manually for the simple table I posted, but I want to use this on large datasets, where I don't know in advance what columns are identical. I do not what to remove and rename columns manually, since I might have over 50 duplicated columns.
Upvotes: 0
Views: 3032
Reputation: 520
Ok, improving on the above answer using the idea from here. Save the duplicate and non-duplicate columns into data frames. Check to see if the non-duplicates match any duplicates, and if so concatenate their columns names. So this will now work if you have more than two duplicate columns.
Editted: Changed summary
to digest
. This helps with character data.
df <- structure(list(test1 = c(1, 2, 3, 4, 5, 6), test2 = c(1, 2, 4,
4, 5, 6), test3 = c(0, 2, 4, 4, 5, 6), test4 = c(1, 2, 3, 4,
5, 6)), .Names = c("test1", "test2", "test3", "test4"), row.names = c(NA,
-6L), class = "data.frame")
library(digest)
nondups <- df[!duplicated(lapply(df, digest))]
dups <- df[duplicated(lapply(df, digest))]
for(i in 1:ncol(nondups)){
for(j in 1:ncol(dups)){
if(FALSE %in% paste0(nondups[,i] == dups[,j])) NULL
else names(nondups)[i] <- paste(names(nondups[i]), names(dups[j]), sep = "+")
}
}
nondups
Example 2, as a function.
Editted: Changed summary
to digest
and return non-duplicated and duplicated data frames.
age <- 18:29
height <- c(76.1,77,78.1,78.2,78.8,79.7,79.9,81.1,81.2,81.8,82.8,83.5)
gender <- c("M","F","M","M","F","F","M","M","F","M","F","M")
testframe <- data.frame(age=age,height=height,height2=height,gender=gender,gender2=gender, gender3 = gender)
dupcols <- function(df = testframe){
nondups <- df[!duplicated(lapply(df, digest))]
dups <- df[duplicated(lapply(df, digest))]
for(i in 1:ncol(nondups)){
for(j in 1:ncol(dups)){
if(FALSE %in% paste0(nondups[,i] == dups[,j])) NULL
else names(nondups)[i] <- paste(names(nondups[i]), names(dups[j]), sep = "+")
}
}
return(list(df1 = nondups, df2 = dups))
}
dupcols(df = testframe)
Editted: This section is new.
Example 3: On a large data frame
#Creating a 1500 column by 15000 row data frame
dat <- do.call(data.frame, replicate(1500, rep(FALSE, 15000), simplify=FALSE))
names(dat) <- 1:1500
#Fill the data frame with LETTERS across the rows
#This part may take a while. Took my PC about 23 minutes.
start <- Sys.time()
fill <- rep(LETTERS, times = ceiling((15000*1500)/26))
j <- 0
for(i in 1:nrow(dat)){
dat[i,] <- fill[(1+j):(1500+j)]
j <- j + 1500
}
difftime(Sys.time(), start, "mins")
#Run the function on the created data set
#This took about 4 minutes to complete on my PC.
start <- Sys.time()
result <- dupcols(df = dat)
difftime(Sys.time(), start, "mins")
names(result$df1)
ncol(result$df1)
ncol(result$df2)
Upvotes: 1
Reputation: 520
It's not completely automated, but the output of the loop will identify pairs of duplicate columns. You'll then have to remove one of the duplicate columns and then re-name based on what columns were duplicates.
df <- structure(list(test1 = c(1, 2, 3, 4, 5, 6), test2 = c(1, 2, 4,
4, 5, 6), test3 = c(0, 2, 4, 4, 5, 6), test4 = c(1, 2, 3, 4,
5, 6)), .Names = c("test1", "test2", "test3", "test4"), row.names = c(NA,
-6L), class = "data.frame")
for(i in 1:(ncol(df)-1)){
for(j in 2:ncol(df)){
if(i == j) NULL
else if(FALSE %in% paste0(df[,i] == df[,j])) NULL
else print(paste(i, j, sep = " + "))
}
}
new <- df[,-4]
names(new)[1] <- paste(names(df[1]), names(df[4]), sep = "+")
new
Upvotes: 0