Compare the column names of multiple CSV files in R

Question

I'm combining 12 CSV files into one dataframe in R. Before doing this I want to ensure all the column names are an exact match with each other. I've made a dataframe where each column is the column names of the 12 CSV files.

jul21_cols <- data.frame(colnames(jul21))
aug21_cols <- data.frame(colnames(aug21))
sep21_cols <- data.frame(colnames(sep21))
oct21_cols <- data.frame(colnames(oct21))
nov21_cols <- data.frame(colnames(nov21))
dec21_cols <- data.frame(colnames(dec21))
jan22_cols <- data.frame(colnames(jan22))
feb22_cols <- data.frame(colnames(feb22))
mar22_cols <- data.frame(colnames(mar22))
apr22_cols <- data.frame(colnames(apr22))
may22_cols <- data.frame(colnames(may22))
jun22_cols <- data.frame(colnames(jun22))

col_df <- cbind(jul21_cols,aug21_cols,sep21_cols,oct21_cols,nov21_cols,dec21_cols,
                 jan22_cols,feb22_cols,mar22_cols,apr22_cols,may22_cols,jun22_cols)

I've tried using the identical function to compare 2 columns at a time.

identical(col_df[['jul21']], col_df[['aug21']])
identical(col_df[['aug21']], col_df[['sep21']])
identical(col_df[['sep21']], col_df[['oct21']])
identical(col_df[['oct21']], col_df[['nov21']])
identical(col_df[['nov21']], col_df[['dec21']])
identical(col_df[['dec21']], col_df[['jan22']])
identical(col_df[['jan22']], col_df[['feb22']])
identical(col_df[['feb22']], col_df[['mar22']])
identical(col_df[['mar22']], col_df[['apr22']])
identical(col_df[['apr22']], col_df[['may22']])
identical(col_df[['may22']], col_df[['jun22']])`

All of the identical lines return the value of TRUE

I'm just trying to verify that this code is telling me all my column names are identical in each CSV files before I move on. I'd also like to know if there is a more efficient way to solve this problem.

zephryl · Accepted Answer

First, identical() will only return TRUE if the two dataframes have all the same column names in the same order. If you don’t care about order, just that all the same names are in both dataframes, you can sort() the names before comparing as shown below.

Second, you can often use the base::lapply() or purrr::map() families of functions for operations requiring iteration.

For your case, let’s put your dataframes in a list (which they probably should be to begin with), then use sapply() to compare the column names of the first df in the list to the column names of all other dfs.

jul21 <- data.frame(x = 1, y = 2)
aug21 <- data.frame(x = 3, y = 4) 
sep21 <- data.frame(y = 6, x = 5)

dfs <- list(jul21,aug21,sep21)

all(sapply(
  dfs[-1], 
  \(x) identical(sort(colnames(x)), sort(colnames(dfs[[1]])))
))
# TRUE

And as another test case, we’ll add a df with a non-matching column.

oct22 <- data.frame(x = 1, y = 2, z = 3)
dfs[[4]] <- oct22

all(sapply(
  dfs[-1], 
  \(x) identical(sort(colnames(x)), sort(colnames(dfs[[1]])))
))
# FALSE

Compare the column names of multiple CSV files in R

Answers (2)

graph

Note

Related Questions