Reputation: 231
I have a script that loops through multiple years of data, one year at a time. Each year of data consists of multiple dataframes that are placed in a list called all_input
. At the beginning of the loop (after the data is read in), I am trying to get all of the years of data in the same format before the rest of the processing.
The issue I am having is with column names, which are not uniform.
There are 5 columns included in each dataframe that I want to keep, and I want them to be called total_emissions
uom
tribal_name
st_usps_cd
and description
. In some dataframes they already have these names, while in others they have various names such as pollutant.desc
or pollutant_desc
, for example.
My current approach is this:
# Create a mapping file for the column names
header_map <- data.frame(orignal_col = c( "pollutant_desc", "pollutant.desc", "emissions.uom", "total.emissions", "tribal.name", "state" ),
new_col = c( "description", "description", "uom", "total_emissions", "tribal_name", "st_usps_cd" ), stringsAsFactors = FALSE)
# change the column names
lapply(all_input, function(x) {
names(x)[match(header_map$orignal_col, names(x))] <- header_map$new_col
x
}) -> all_input
Which creates a header mapping file that looks like this:
original_col new_col
pollutant_desc description
pollutant.desc description
emissions.uom uom
total.emissions total_emissions
tribal.name tribal_name
state st_usps_cd
The error I am getting is as follows:
Error in names(x)[match(header_map$orignal_col, names(x))] <- header_map$new_col :
NAs are not allowed in subscripted assignments
I understand that as I will have to manually add entries to the header file as new years of data with different column names are processed, but how can I get this to work?
Fake Sample Data. df1 and df2 represent the format of the "2017" data, where multiple columns need name changes, but the current names are consistent between dataframes. df3 represents "2011" data, where all of the column names are as they should be. df4 represents "2014" data, where the only column that needs to be changed is pollutant_desc
. Note, there are extra columns in each dataframe that are not needed and can be ignored. And reminder, these dataframes are not all read at the same time. The loop is by year, so df1 and df2 (in list all_input
) will be formatted and processed. Then all of the data is removed, and a new all_input
list is created with the next years dataframes, which will have different column names. The code must work for all years without being changed.
> dput(df1)
structure(list(total.emissions = structure(1:2, .Label = c("100",
"300"), class = "factor"), emissions.uom = structure(1:2, .Label = c("LB",
"TON"), class = "factor"), international = c(TRUE, TRUE), hours = structure(2:1, .Label = c("17",
"3"), class = "factor"), tribal.name = structure(2:1, .Label = c("FLLK",
"SUWJG"), class = "factor"), state = structure(1:2, .Label = c("AK",
"MN"), class = "factor"), pollutant.desc = structure(1:2, .Label = c("Methane",
"NO2"), class = "factor"), policy = c(TRUE, FALSE)), class = "data.frame", row.names = c(NA,
-2L))
> dput(df2)
structure(list(total.emissions = structure(2:1, .Label = c("20",
"400"), class = "factor"), emissions.uom = structure(c(1L, 1L
), .Label = "TON", class = "factor"), international = c(FALSE,
TRUE), hours = structure(2:1, .Label = c("1", "8"), class = "factor"),
tribal.name = structure(2:1, .Label = c("SOSD", "WMFJU"), class = "factor"),
state = structure(2:1, .Label = c("SD", "WY"), class = "factor"),
pollutant.desc = structure(1:2, .Label = c("CO2", "SO2"), class = "factor"),
policy = c(FALSE, FALSE)), class = "data.frame", row.names = c(NA,
-2L))
> dput(df3)
structure(list(total_emissions = structure(2:1, .Label = c("200",
"30"), class = "factor"), uom = structure(c(1L, 1L), .Label = "TON", class = "factor"),
boundaries = structure(2:1, .Label = c("N", "Y"), class = "factor"),
tribal_name = structure(2:1, .Label = c("SOSD", "WMFJU"), class = "factor"),
st_usps_cd = structure(2:1, .Label = c("ID", "KS"), class = "factor"),
description = structure(c(1L, 1L), .Label = "SO2", class = "factor"),
policy = c(FALSE, TRUE), time = structure(1:2, .Label = c("17",
"7"), class = "factor")), class = "data.frame", row.names = c(NA,
-2L))
> dput(df4)
structure(list(total_emissions = structure(2:1, .Label = c("700",
"75"), class = "factor"), uom = structure(c(1L, 1L), .Label = "LB", class = "factor"),
tribal_name = structure(1:2, .Label = c("SSJY", "WNCOPS"), class = "factor"),
st_usps_cd = structure(1:2, .Label = c("MO", "NY"), class = "factor"),
pollutant_desc = structure(2:1, .Label = c("CO2", "Methane"
), class = "factor"), boundaries = structure(c(1L, 1L), .Label = "N", class = "factor"),
policy = c(FALSE, FALSE), time = structure(1:2, .Label = c("2",
"3"), class = "factor")), class = "data.frame", row.names = c(NA,
-2L))
Thank you!
Upvotes: 2
Views: 160
Reputation: 160447
Try this:
list_of_frames1 <- list(df1, df2, df3, df4)
list_of_frames2 <- lapply(list_of_frames1, function(x) {
nms <- intersect(names(x), header_map$orignal_col)
names(x)[ match(nms, names(x)) ] <- header_map$new_col[ match(nms, header_map$orignal_col) ]
x
})
Upvotes: 2