Reputation: 17
Original dataset is pulled from https://covid.cdc.gov/covid-data-tracker/#vaccinations_vacc-total-admin-rate-total
I am down to the summary of the variables of interest. My desire here is to change the column's that contain 'PCT' to NA, as a sum of percent's is not meaningful.
# summarise_all(CDC_vacc_State, ~if(is.numeric(.)) sum(., na.rm = T) else "National")
Here I am going to output the dataset. Here we can see that there are 8 columns that contain 'PCT'. I do understand that my national population is off by 1.2M, I back-solved it from the data rather than merging in Census values.
# dput(summarise_all(CDC_vacc_State, ~if(is.numeric(.)) sum(., na.rm = T) else "National"))
df <- structure(
list(
STATE = "National",
ONE_DOSE_NUM = 220978782,
ONE_DOSE_PCT = 3752.9,
FULL_DOSE_NUM = 190719566,
FULL_DOSE_PCT = 3276.7,
POP = 334203604,
ONE_DOSE_NUM_18Plus = 206114881,
ONE_DOSE_PCT_18Plus = 4412.7,
FULL_DOSE_NUM_18Plus = 178587829,
FULL_DOSE_PCT_18Plus = 3883.9,
POP_18Plus = 258316189,
ONE_DOSE_NUM_65Plus = 53312849,
ONE_DOSE_PCT_65Plus = 5112.9,
FULL_DOSE_NUM_65Plus = 47324330,
FULL_DOSE_PCT_65Plus = 4652.6,
POP_65Plus = 55128330,
ONE_DOSE_NUM_12Plus = 219898471,
ONE_DOSE_PCT_12Plus = 4258.4,
FULL_DOSE_NUM_12Plus = 189835922,
FULL_DOSE_PCT_12Plus = 3733.8,
POP_12Plus = 282154872
),
class = c("tbl_df", "tbl", "data.frame"),
row.names = c(NA,-1L)
)
I have figured out a solution, but I was hoping to do this inline using dplyr and use a bind_rows() or rbindlist() with the original dataset.
# this is my current solution
df[,names(select(df, contains('PCT')))] <- NA
Final dataset looks like
dput(df)
structure(list(STATE = "National", ONE_DOSE_NUM = 220978782,
ONE_DOSE_PCT = NA, FULL_DOSE_NUM = 190719566, FULL_DOSE_PCT = NA,
POP = 334203604, ONE_DOSE_NUM_18Plus = 206114881, ONE_DOSE_PCT_18Plus = NA,
FULL_DOSE_NUM_18Plus = 178587829, FULL_DOSE_PCT_18Plus = NA,
POP_18Plus = 258316189, ONE_DOSE_NUM_65Plus = 53312849, ONE_DOSE_PCT_65Plus = NA,
FULL_DOSE_NUM_65Plus = 47324330, FULL_DOSE_PCT_65Plus = NA,
POP_65Plus = 55128330, ONE_DOSE_NUM_12Plus = 219898471, ONE_DOSE_PCT_12Plus = NA,
FULL_DOSE_NUM_12Plus = 189835922, FULL_DOSE_PCT_12Plus = NA,
POP_12Plus = 282154872), row.names = c(NA, -1L), class = c("tbl_df",
"tbl", "data.frame"))
Upvotes: 0
Views: 106
Reputation: 66415
dplyr equivalent of your df[,names(select(df, contains('PCT')))] <- NA
:
df %>%
mutate(across(contains("PCT"), ~NA))
Alternatively, you could deal with this "down-stream" by summarizing only the non-PCT columns. If you bind_rows
to your original data, I believe the PCT columns would NA since they wouldn't be in the result here:
df %>%
summarize(across(where(is.numeric) & !contains("PCT"),
~sum(.x, na.rm = T)))
Upvotes: 3