Reputation: 33
I get CSV's with hundreds of different columns and would like to be able to output a new file with the duplicate values removed from each column. Everything that I have seen and tried uses a specific column. I just need each column to be unique values.
For Example My Data:
df <- data.frame(A = c(1, 2, 3, 4, 5, 6), B = c(1, 0, 1, 0, 0, 1), C = c("Mr.","Mr.","Mrs.","Miss","Mr.","Mrs."))
df
A B C
1 1 1 Mr.
2 2 0 Mr.
3 3 1 Mrs.
4 4 0 Miss
5 5 0 Mr.
6 6 1 Mrs.
I would like:
A B C
1 1 1 Mr.
2 2 0 Mrs.
3 3 Miss
4 4
5 5
6 6
Then I can:
write.csv(df, file = file.path(df, "df_No_Dupes.csv"), na="")
So I can use it as a reference for my next task.
Upvotes: 1
Views: 1325
Reputation: 787
Code snippet to work with a flexible number of columns, remove duplicate columns, and preserve column names:
require(rowr)
df <- data.frame(A = c(1, 2, 3, 4, 5, 6), B = c(1, 0, 1, 0, 0, 1), C = c("Mr.","Mr.","Mrs.","Miss","Mr.","Mrs."))
#get the number of columns in the dataframe
n <- ncol(df)
#loop through the columns
for(i in 1:ncol(df)){
#replicate column i without duplicates, fill blanks with NAs
df <- cbind.fill(df,unique(df[,1]), fill = NA)
#rename the new column
colnames(df)[n+1] <- colnames(df)[1]
#delete the old column
df[,1] <- NULL
}
Upvotes: 1
Reputation: 23231
df <- data.frame(A = c(1, 2, 3, 4, 5, 6), B = c(1, 0, 1, 0, 0, 1), C = c("Mr.","Mr.","Mrs.","Miss","Mr.","Mrs."))
for(i in 1:ncol(df)){
assign(paste("df_",i,sep=""), unique(df[,i]))
}
require(rowr)
df <- cbind.fill(df_1,df_2,df_3, fill = NA)
V1 V1 V1 1 1 1 Mr. 2 2 0 Mrs. 3 3 NA Miss 4 4 NA <NA> 5 5 NA <NA> 6 6 NA <NA>
or you could do
require(rowr)
df <- cbind.fill(df_1,df_2,df_3, fill = "")
df
V1 V1 V1 1 1 1 Mr. 2 2 0 Mrs. 3 3 Miss 4 4 5 5 6 6
If you want to avoid typing the name of each intermediate dataframe you can just use ls(pattern="df_")
and get
the objects named in that vector or use another loop.
If you want to change the column names back to their original values you can use:
colnames(output_df) <- colnames(input_df)
Then you can save the results however you, like, i.e.
saveRDS()
save()
or write it to a file.
Putting it all together:
df <- data.frame(A = c(1, 2, 3, 4, 5, 6), B = c(1, 0, 1, 0, 0, 1), C = c("Mr.","Mr.","Mrs.","Miss","Mr.","Mrs."))
for(i in 1:ncol(df)){
assign(paste("df_",i,sep=""), unique(df[,i]))
}
require(rowr)
files <- ls(pattern="df_")
df_output <- data.frame()
for(i in files){
df_output <- cbind.fill(df_output, get(i), fill = "")
}
df_output <- df_output[,2:4] # fix extra colname from initialization
colnames(df_output) <- colnames(df)
write.csv(df_output, "df_out.csv",row.names = F)
verify_it_worked <- read.csv("df_out.csv")
verify_it_worked
A B C 1 1 1 Mr. 2 2 0 Mrs. 3 3 Miss 4 4 5 5 6 6
Upvotes: 0
Reputation: 206546
read.csv
and write.csv
work best with tabular data. Your desired output is not a good example of this (every row does not have the same number of columns).
You can easily get all the unique value for your columns with
vals <- sapply(df, unique)
Then you'd be better off saving this object with save()
and load()
to preserve the list as an R object.
Upvotes: 1