Michael Thompson
Michael Thompson

Reputation: 13

Compare the column names of multiple CSV files in R

I'm combining 12 CSV files into one dataframe in R. Before doing this I want to ensure all the column names are an exact match with each other. I've made a dataframe where each column is the column names of the 12 CSV files.

jul21_cols <- data.frame(colnames(jul21))
aug21_cols <- data.frame(colnames(aug21))
sep21_cols <- data.frame(colnames(sep21))
oct21_cols <- data.frame(colnames(oct21))
nov21_cols <- data.frame(colnames(nov21))
dec21_cols <- data.frame(colnames(dec21))
jan22_cols <- data.frame(colnames(jan22))
feb22_cols <- data.frame(colnames(feb22))
mar22_cols <- data.frame(colnames(mar22))
apr22_cols <- data.frame(colnames(apr22))
may22_cols <- data.frame(colnames(may22))
jun22_cols <- data.frame(colnames(jun22))

col_df <- cbind(jul21_cols,aug21_cols,sep21_cols,oct21_cols,nov21_cols,dec21_cols,
                 jan22_cols,feb22_cols,mar22_cols,apr22_cols,may22_cols,jun22_cols)

I've tried using the identical function to compare 2 columns at a time.

identical(col_df[['jul21']], col_df[['aug21']])
identical(col_df[['aug21']], col_df[['sep21']])
identical(col_df[['sep21']], col_df[['oct21']])
identical(col_df[['oct21']], col_df[['nov21']])
identical(col_df[['nov21']], col_df[['dec21']])
identical(col_df[['dec21']], col_df[['jan22']])
identical(col_df[['jan22']], col_df[['feb22']])
identical(col_df[['feb22']], col_df[['mar22']])
identical(col_df[['mar22']], col_df[['apr22']])
identical(col_df[['apr22']], col_df[['may22']])
identical(col_df[['may22']], col_df[['jun22']])`

All of the identical lines return the value of TRUE

I'm just trying to verify that this code is telling me all my column names are identical in each CSV files before I move on. I'd also like to know if there is a more efficient way to solve this problem.

Upvotes: 1

Views: 403

Answers (2)

G. Grothendieck
G. Grothendieck

Reputation: 269654

We assume that what is needed is to determine if the column names are the same and in same order and if not to determine which differ.

First get a character vector, Names, containing the names of the data frames and from that make a named list L containing the data frames themselves.

From those names assemble a list L of the data frames and then get a character vector nms whose elements are strings of column names, one for each data frame.

Finally group the names of the data frames using tapply and nms as the groupings so we can see which data frames contain which columns. In the example below aug21 and jul21 have one set of columns, i.e. Time and demand, and sep21 has a different set, i.e. Time and DEMAND. If there were only one row then all data frames have the same column names in the same order.

Names <- c("jul21", "aug21", "sep21") # using example in Note

L <- mget(Names)[Names]

nms <- sapply(names(L), function(x) toString(names(L[[x]])))
tab <- stack(tapply(names(nms), nms, toString))
names(tab) <- c("data.frames", "column.names")

nrow(tab)
## [1] 2

tab
##    data.frames column.names
## 1 jul21, aug21 Time, demand
## 2        sep21 Time, DEMAND

graph

Another approach which could be used alternately or in conjuction with the one above is to create a graph such that each vertex is a data frame and each edge means that the two vertices on either end of the edge have the same column names in the same order. Each connected component represents distinct column names or orders. From the example below we see that jul21 and aug21 form one connected component and sep21 forms a second connected component.

To investigate how data frame column names differ note that setdiff(names(jul21), names(sep21)) will show names that are in jul21 but not in sep21 and the reverse can be used for the other direction. If the setdiff in both directions are zero length vectors and names vectors are not the same then they differ by order.

library(igraph)
set.seed(123)

isSame <- function(x, y) +identical(names(x), names(y))
A <- outer(L, L, Vectorize(isSame))
diag(A) <- 0

g <- graph_from_adjacency_matrix(A, "undirected")
plot(g, vertex.color = "white", vertex.size = 30)

screenshot

Note

Test data. BOD comes with R.

jul21 <- aug21 <- sep21 <- BOD
names(sep21) <- c("Time", "DEMAND")

Upvotes: 0

zephryl
zephryl

Reputation: 17134

First, identical() will only return TRUE if the two dataframes have all the same column names in the same order. If you don’t care about order, just that all the same names are in both dataframes, you can sort() the names before comparing as shown below.

Second, you can often use the base::lapply() or purrr::map() families of functions for operations requiring iteration.

For your case, let’s put your dataframes in a list (which they probably should be to begin with), then use sapply() to compare the column names of the first df in the list to the column names of all other dfs.

jul21 <- data.frame(x = 1, y = 2)
aug21 <- data.frame(x = 3, y = 4) 
sep21 <- data.frame(y = 6, x = 5)

dfs <- list(jul21,aug21,sep21)

all(sapply(
  dfs[-1], 
  \(x) identical(sort(colnames(x)), sort(colnames(dfs[[1]])))
))
# TRUE

And as another test case, we’ll add a df with a non-matching column.

oct22 <- data.frame(x = 1, y = 2, z = 3)
dfs[[4]] <- oct22

all(sapply(
  dfs[-1], 
  \(x) identical(sort(colnames(x)), sort(colnames(dfs[[1]])))
))
# FALSE

Upvotes: 1

Related Questions