Reputation: 23
I am analyzing data stored in separate monthly files across 5 years. The individual datasets are expected to have different numbers and mixes of rows (cases) per dataset, but should have the same variables and variable names. However, I know from looking at them that some do not, and that I will have to manipulate these before merging all datasets into one data frame. Is there any way of automating the process of identifying these, in particular to document the changes I make?
For example, say I have the following 4 datasets:
df1 <- data.frame(id = c(1, 2, 3), age = c(10, 20, 30), color =
c("green", "purple", "gold"))
df2 <- data.frame(id = c(1, 2, 3, 4), age = c(10, 20, 30, 40), color =
c("green", "purple", "gold", "blue"))
df3 <- data.frame(id = c(1, 3, 4), age = c(10, 30, 40), colour =
c("green", "gold", "blue"), kids = c(0, 1, 1))
df4 <- data.frame(id = c(1, 2, 4), color = c("green", "purple",
"blue"), age = c(10, 20, 40))
All 4 datasets should look like df1 and df2, but df3 has a different spelling for one variable ("colour") and extra variable ("kids"), while df4 reverses the order of two variables ("age" and "color").
I can visually inspect column numbers and names to identify problems, like so:
df.list <- list(df1,df2,df3,df4)
df.cols <- lapply(df.list, function(x) ncol(x))
df.names <- lapply(df.list, function(x) names(x))
Output
> df.cols
[[1]]
[1] 3
[[2]]
[1] 3
[[3]]
[1] 4
[[4]]
[1] 3
> df.names
[[1]]
[1] "id" "age" "color"
[[2]]
[1] "id" "age" "color"
[[3]]
[1] "id" "age" "colour" "kids"
[[4]]
[1] "id" "color" "age"
But I have many more variables and and datasets. Is there a more efficient way to do this, to automatically and reproducibly identify mismatches? I have only found ways to compare across pairs of datasets. Thanks for any help or suggestions!
Upvotes: 2
Views: 747
Reputation: 5673
An easy way can be
dflist <- mget(paste0("df", 1:4))
nametest <- sapply(dflist,function(x){names(x) %in% names(df1)})
$df1
[1] TRUE TRUE TRUE
$df2
[1] TRUE TRUE TRUE
$df3
[1] TRUE TRUE FALSE FALSE
$df4
[1] TRUE TRUE TRUE
And then to see only the problematic ones
quickview <- unlist(sapply(nametest,function(x){if(!all(x))"name pb"}))
df3
"name pb"
This way you verify names and column at the same time (additional column with other name will give a FALSE in nametest). If you want the index of the wrong columns:
index <- sapply(dflist,function(x){
which(!names(x) %in% names(df1))})
$df1
integer(0)
$df2
integer(0)
$df3
[1] 3 4
$df4
integer(0)
And if you want the names of the columns:
wrongnames <- sapply(dflist,function(x){
indx <- which(!names(x) %in% names(df1))
names(x)[indx]
})
Upvotes: 1