Maridee Weber
Maridee Weber

Reputation: 231

How do I change the names of columns in multiple dataframes using a mapping file in R?

I have a script that loops through multiple years of data, one year at a time. Each year of data consists of multiple dataframes that are placed in a list called all_input. At the beginning of the loop (after the data is read in), I am trying to get all of the years of data in the same format before the rest of the processing.

The issue I am having is with column names, which are not uniform. There are 5 columns included in each dataframe that I want to keep, and I want them to be called total_emissions uom tribal_name st_usps_cd and description. In some dataframes they already have these names, while in others they have various names such as pollutant.desc or pollutant_desc, for example.

My current approach is this:

# Create a mapping file for the column names
    header_map <- data.frame(orignal_col = c( "pollutant_desc", "pollutant.desc", "emissions.uom", "total.emissions", "tribal.name", "state" ), 
                               new_col = c( "description", "description", "uom", "total_emissions", "tribal_name", "st_usps_cd" ), stringsAsFactors = FALSE)

    # change the column names
    lapply(all_input, function(x) {
      names(x)[match(header_map$orignal_col, names(x))] <- header_map$new_col
      x
    }) -> all_input

Which creates a header mapping file that looks like this:

original_col         new_col
pollutant_desc       description
pollutant.desc       description
emissions.uom        uom
total.emissions      total_emissions
tribal.name          tribal_name
state                st_usps_cd

The error I am getting is as follows:

Error in names(x)[match(header_map$orignal_col, names(x))] <- header_map$new_col : 
  NAs are not allowed in subscripted assignments

I understand that as I will have to manually add entries to the header file as new years of data with different column names are processed, but how can I get this to work?

Fake Sample Data. df1 and df2 represent the format of the "2017" data, where multiple columns need name changes, but the current names are consistent between dataframes. df3 represents "2011" data, where all of the column names are as they should be. df4 represents "2014" data, where the only column that needs to be changed is pollutant_desc. Note, there are extra columns in each dataframe that are not needed and can be ignored. And reminder, these dataframes are not all read at the same time. The loop is by year, so df1 and df2 (in list all_input) will be formatted and processed. Then all of the data is removed, and a new all_input list is created with the next years dataframes, which will have different column names. The code must work for all years without being changed.

> dput(df1)
structure(list(total.emissions = structure(1:2, .Label = c("100", 
"300"), class = "factor"), emissions.uom = structure(1:2, .Label = c("LB", 
"TON"), class = "factor"), international = c(TRUE, TRUE), hours = structure(2:1, .Label = c("17", 
"3"), class = "factor"), tribal.name = structure(2:1, .Label = c("FLLK", 
"SUWJG"), class = "factor"), state = structure(1:2, .Label = c("AK", 
"MN"), class = "factor"), pollutant.desc = structure(1:2, .Label = c("Methane", 
"NO2"), class = "factor"), policy = c(TRUE, FALSE)), class = "data.frame", row.names = c(NA, 
-2L))
> dput(df2)
structure(list(total.emissions = structure(2:1, .Label = c("20", 
"400"), class = "factor"), emissions.uom = structure(c(1L, 1L
), .Label = "TON", class = "factor"), international = c(FALSE, 
TRUE), hours = structure(2:1, .Label = c("1", "8"), class = "factor"), 
    tribal.name = structure(2:1, .Label = c("SOSD", "WMFJU"), class = "factor"), 
    state = structure(2:1, .Label = c("SD", "WY"), class = "factor"), 
    pollutant.desc = structure(1:2, .Label = c("CO2", "SO2"), class = "factor"), 
    policy = c(FALSE, FALSE)), class = "data.frame", row.names = c(NA, 
-2L))
> dput(df3)
structure(list(total_emissions = structure(2:1, .Label = c("200", 
"30"), class = "factor"), uom = structure(c(1L, 1L), .Label = "TON", class = "factor"), 
    boundaries = structure(2:1, .Label = c("N", "Y"), class = "factor"), 
    tribal_name = structure(2:1, .Label = c("SOSD", "WMFJU"), class = "factor"), 
    st_usps_cd = structure(2:1, .Label = c("ID", "KS"), class = "factor"), 
    description = structure(c(1L, 1L), .Label = "SO2", class = "factor"), 
    policy = c(FALSE, TRUE), time = structure(1:2, .Label = c("17", 
    "7"), class = "factor")), class = "data.frame", row.names = c(NA, 
-2L))
> dput(df4)
structure(list(total_emissions = structure(2:1, .Label = c("700", 
"75"), class = "factor"), uom = structure(c(1L, 1L), .Label = "LB", class = "factor"), 
    tribal_name = structure(1:2, .Label = c("SSJY", "WNCOPS"), class = "factor"), 
    st_usps_cd = structure(1:2, .Label = c("MO", "NY"), class = "factor"), 
    pollutant_desc = structure(2:1, .Label = c("CO2", "Methane"
    ), class = "factor"), boundaries = structure(c(1L, 1L), .Label = "N", class = "factor"), 
    policy = c(FALSE, FALSE), time = structure(1:2, .Label = c("2", 
    "3"), class = "factor")), class = "data.frame", row.names = c(NA, 
-2L))

Thank you!

Upvotes: 2

Views: 160

Answers (1)

r2evans
r2evans

Reputation: 160447

Try this:

list_of_frames1 <- list(df1, df2, df3, df4)
list_of_frames2 <- lapply(list_of_frames1, function(x) {
  nms <- intersect(names(x), header_map$orignal_col)
  names(x)[ match(nms, names(x)) ] <- header_map$new_col[ match(nms, header_map$orignal_col) ]
  x
})

Upvotes: 2

Related Questions