Reputation: 33

R remove duplicate data from each column

I get CSV's with hundreds of different columns and would like to be able to output a new file with the duplicate values removed from each column. Everything that I have seen and tried uses a specific column. I just need each column to be unique values.

For Example My Data:

df <- data.frame(A = c(1, 2, 3, 4, 5, 6), B = c(1, 0, 1, 0, 0, 1), C = c("Mr.","Mr.","Mrs.","Miss","Mr.","Mrs."))
df
    A B    C
  1 1 1  Mr.
  2 2 0  Mr.
  3 3 1 Mrs.
  4 4 0 Miss
  5 5 0  Mr.
  6 6 1 Mrs.

I would like:

Then I can:

write.csv(df, file = file.path(df, "df_No_Dupes.csv"), na="")

So I can use it as a reference for my next task.

Upvotes: 1

Answers (3)

PhilC

Reputation: 787

Code snippet to work with a flexible number of columns, remove duplicate columns, and preserve column names:

require(rowr)

df <- data.frame(A = c(1, 2, 3, 4, 5, 6), B = c(1, 0, 1, 0, 0, 1), C = c("Mr.","Mr.","Mrs.","Miss","Mr.","Mrs."))

#get the number of columns in the dataframe
n <- ncol(df)

#loop through the columns
for(i in 1:ncol(df)){

  #replicate column i without duplicates, fill blanks with NAs
  df <-  cbind.fill(df,unique(df[,1]), fill = NA)
  #rename the new column
  colnames(df)[n+1] <- colnames(df)[1]
  #delete the old column
  df[,1] <- NULL
}

Upvotes: 1

Hack-R

Reputation: 23231

df <- data.frame(A = c(1, 2, 3, 4, 5, 6), B = c(1, 0, 1, 0, 0, 1), C = c("Mr.","Mr.","Mrs.","Miss","Mr.","Mrs."))


for(i in 1:ncol(df)){
  assign(paste("df_",i,sep=""), unique(df[,i]))
}

require(rowr)
df <- cbind.fill(df_1,df_2,df_3, fill = NA)

  V1 V1   V1
1  1  1  Mr.
2  2  0 Mrs.
3  3 NA Miss
4  4 NA <NA>
5  5 NA <NA>
6  6 NA <NA>

or you could do

require(rowr)
df <- cbind.fill(df_1,df_2,df_3, fill = "")
df

  V1 V1   V1
1  1  1  Mr.
2  2  0 Mrs.
3  3    Miss
4  4        
5  5        
6  6

If you want to avoid typing the name of each intermediate dataframe you can just use ls(pattern="df_") and get the objects named in that vector or use another loop.

If you want to change the column names back to their original values you can use:

colnames(output_df) <- colnames(input_df)

Then you can save the results however you, like, i.e.

saveRDS()

save()

or write it to a file.

Putting it all together:

df <- data.frame(A = c(1, 2, 3, 4, 5, 6), B = c(1, 0, 1, 0, 0, 1), C = c("Mr.","Mr.","Mrs.","Miss","Mr.","Mrs."))


for(i in 1:ncol(df)){
  assign(paste("df_",i,sep=""), unique(df[,i]))
}

require(rowr)
files     <- ls(pattern="df_")

df_output <- data.frame()
for(i in files){
  df_output <- cbind.fill(df_output, get(i), fill = "")
}

df_output <- df_output[,2:4] # fix extra colname from initialization
colnames(df_output) <- colnames(df)
write.csv(df_output, "df_out.csv",row.names = F)

verify_it_worked <- read.csv("df_out.csv")
verify_it_worked

  A  B    C
1 1  1  Mr.
2 2  0 Mrs.
3 3    Miss
4 4      
5 5      
6 6

Upvotes: 0

MrFlick

Reputation: 206546

read.csv and write.csv work best with tabular data. Your desired output is not a good example of this (every row does not have the same number of columns).

You can easily get all the unique value for your columns with

vals <- sapply(df, unique)

Then you'd be better off saving this object with save() and load() to preserve the list as an R object.

Upvotes: 1

R remove duplicate data from each column

Answers (3)

Related Questions