melamala
melamala

Reputation: 23

Compare columns across multiple data frames in R

I am analyzing data stored in separate monthly files across 5 years. The individual datasets are expected to have different numbers and mixes of rows (cases) per dataset, but should have the same variables and variable names. However, I know from looking at them that some do not, and that I will have to manipulate these before merging all datasets into one data frame. Is there any way of automating the process of identifying these, in particular to document the changes I make?

For example, say I have the following 4 datasets:

df1 <- data.frame(id = c(1, 2, 3), age = c(10, 20, 30), color = 
       c("green", "purple", "gold"))
df2 <- data.frame(id = c(1, 2, 3, 4), age = c(10, 20, 30, 40), color = 
       c("green", "purple", "gold", "blue"))
df3 <- data.frame(id = c(1, 3, 4), age = c(10, 30, 40), colour = 
       c("green", "gold", "blue"), kids = c(0, 1, 1))
df4 <- data.frame(id = c(1, 2, 4), color = c("green", "purple", 
       "blue"), age = c(10, 20, 40))

All 4 datasets should look like df1 and df2, but df3 has a different spelling for one variable ("colour") and extra variable ("kids"), while df4 reverses the order of two variables ("age" and "color").

I can visually inspect column numbers and names to identify problems, like so:

df.list <- list(df1,df2,df3,df4)
df.cols <- lapply(df.list, function(x) ncol(x))
df.names <- lapply(df.list, function(x) names(x))

Output

> df.cols
[[1]]
[1] 3

[[2]]
[1] 3

[[3]]
[1] 4

[[4]]
[1] 3

> df.names
[[1]]
[1] "id"    "age"   "color"

[[2]]
[1] "id"    "age"   "color"

[[3]]
[1] "id"     "age"    "colour" "kids"  

[[4]]
[1] "id"    "color" "age" 

But I have many more variables and and datasets. Is there a more efficient way to do this, to automatically and reproducibly identify mismatches? I have only found ways to compare across pairs of datasets. Thanks for any help or suggestions!

Upvotes: 2

Views: 747

Answers (1)

denis
denis

Reputation: 5673

An easy way can be

dflist <- mget(paste0("df", 1:4))

nametest <- sapply(dflist,function(x){names(x) %in% names(df1)})
$df1
[1] TRUE TRUE TRUE

$df2
[1] TRUE TRUE TRUE

$df3
[1]  TRUE  TRUE FALSE FALSE

$df4
[1] TRUE TRUE TRUE

And then to see only the problematic ones

quickview <- unlist(sapply(nametest,function(x){if(!all(x))"name pb"}))
      df3 
"name pb" 

This way you verify names and column at the same time (additional column with other name will give a FALSE in nametest). If you want the index of the wrong columns:

index <- sapply(dflist,function(x){
  which(!names(x) %in% names(df1))})

$df1
integer(0)

$df2
integer(0)

$df3
[1] 3 4

$df4
integer(0)

And if you want the names of the columns:

wrongnames <- sapply(dflist,function(x){
  indx <- which(!names(x) %in% names(df1))
  names(x)[indx]
  })

Upvotes: 1

Related Questions