Remove duplicate rows for multiple dataframes

Question

I have over 100 dataframes (df1, df2, df3, ....) each contains the same variables. I want to loop through all of them and remove duplicates by id. For df1, I can do:

df1 <- df1[!duplicated(df1$id), ]

How can I do this in an efficient way?

r2evans · Accepted Answer

If you're dealing with 100 similarly-structured data.frames, I suggest instead of naming them uniquely, you put them in a list.

Assuming they are all named df and a number, then you can easily assign them to a list with something like:

df_varnames <- ls()[ grep("^df[0-9]+$", ls()) ]

or, as @MatteoCastagna suggested in a comment:

df_varnames <- ls(pattern = "^df[0-9]+$")

(which is both faster and cleaner). Then:

dflist <- sapply(df_varnames, get, simplify = FALSE)

And from here, your question is simply:

dflist2 <- lapply(dflist, function(z) z[!duplicated(z$id),])

If you must deal with them as individual data.frames (again, discouraged, almost always slows down processing while not adding any functionality), you can try a hack like this (using df_varnames from above):

for (dfname in df_varnames) {
  df <- get(dfname)
  assign(dfname, df[! duplicated(df$id), ])
}

I cringe when I consider using this, but I admit I may not understand your workflow.

Remove duplicate rows for multiple dataframes

Answers (1)

Related Questions