Reputation: 51
I have an issue where I am compiling datasets from multiple sources where they have the same kind of data I need, but the column names are different! So I am trying to figure out how to compile all of this data in a single function where I can select the column names I want based on a ranking system since the column names vary for what I want to be included.
While I'm not sure how to create a proper reproducible example for this specific issue, I came up with this dataset with three dataframes with similar variables but with different names. The goal is to create a ranking system for which variables are selected (or renamed).
As an example, if dataset1 has Temp1, Temp2, Temp3, and Temp4; and dataset2 has T_1_1_1, T_1_1_2, T_1_1_3, and T_1_1_4, I would like to select "Temp1" and "T_1_1_1" before the others. Assume other datasets could have a mix of these or other times of variables names (there will be multiple variables that start with "T").
I am compiling data from over 100 datasets so variability is a issue I am facing with this.
More specifically, I want to create some kind of ranking system to select column names like this based on my example below:
Select values closest to T_1_1_1 first out of variables with this format: T_1_1, T_1_2, T_1_3, T_1_4, T_1_5
Select values closest to Temp1 first out of variables with this format: Temp1, Temp2, Temp3, Temp4
Select values closest to P first out of variables with this format: P, P1, P2, P3, P4, P5, P6, P7
Select values closest to P_FI_1 first out of variables with this format: P_FI_1, P_FI_2, P_FI_3, P_FI_4, P_FI_5, P_FI_6, P_FI_7,
I think I need something from dplyr with rename_at() with any_of() (such as seen here: https://dplyr.tidyverse.org/reference/rename.html) but I'm not sure how to get this to work for my particular problem.
Any help would be greatly appreciated!
library(dplyr)
#Dataset 1
dataset1 <- as_tibble(iris) # so it prints a little nicer
#goal: have it pull Temp1, P3, S2_1 out of all possible options
dataset1$Temp1 <- sample(50, size = nrow(dataset1), replace = TRUE)
dataset1$Temp2 <- sample(50, size = nrow(dataset1), replace = TRUE)
dataset1$Temp3 <- sample(50, size = nrow(dataset1), replace = TRUE)
dataset1$Temp4 <- sample(50, size = nrow(dataset1), replace = TRUE)
dataset1$P6 <- sample(50, size = nrow(dataset1), replace = TRUE)
dataset1$P7 <- sample(50, size = nrow(dataset1), replace = TRUE)
dataset1$P3 <- sample(50, size = nrow(dataset1), replace = TRUE)
dataset1$P4 <- sample(50, size = nrow(dataset1), replace = TRUE)
#Dataset #2
dataset2 <- as_tibble(iris) # so it prints a little nicer
#goal: have it pull T_1_1, P_FI_3, S_1_6 out of all possible options
dataset2$T_1_1 <- sample(50, size = nrow(dataset2), replace = TRUE)
dataset2$T_1_2 <- sample(50, size = nrow(dataset2), replace = TRUE)
dataset2$T_1_3 <- sample(50, size = nrow(dataset2), replace = TRUE)
dataset2$T_1_4 <- sample(50, size = nrow(dataset2), replace = TRUE)
dataset2$P_FI_6 <- sample(50, size = nrow(dataset2), replace = TRUE)
dataset2$P_FI_7 <- sample(50, size = nrow(dataset2), replace = TRUE)
dataset2$P_FI_3 <- sample(50, size = nrow(dataset2), replace = TRUE)
dataset2$P_FI_4 <- sample(50, size = nrow(dataset2), replace = TRUE)
#Dataset #3
dataset3 <- as_tibble(iris)
#goal: have it pull T_1_2, P, S out of all possible options
dataset3$T_1_2 <- sample(50, size = nrow(dataset3), replace = TRUE)
dataset3$T_1_3 <- sample(50, size = nrow(dataset3), replace = TRUE)
dataset3$T_1_4 <- sample(50, size = nrow(dataset3), replace = TRUE)
dataset3$T_1_5 <- sample(50, size = nrow(dataset3), replace = TRUE)
dataset1$P <- sample(50, size = nrow(dataset1), replace = TRUE)
dataset1$P7 <- sample(50, size = nrow(dataset1), replace = TRUE)
dataset1$P6 <- sample(50, size = nrow(dataset1), replace = TRUE)
dataset1$P4 <- sample(50, size = nrow(dataset1), replace = TRUE)
#Goals - I want something like this for the end result. It can be rename or subset, whatever is easiest.
#I'm not sure how to create a ranking system for the column names to select from but I'm thinking something like this:
#Select values closest to T_1_1_1 first out of variables with this format
#T_1_1
#T_1_2
#T_1_3
#T_1_4
#T_1_5
#Select values closest to Temp1 first out of variables with this format
#Temp1
#Temp2
#Temp3
#Temp4
#Select values closest to P first out of variables with this format
#P
#P1
#P2
#P3
#P4
#P5
#P6
#P7
#Select values closest to P_FI_1 first out of variables with this format
#P_FI_1
#P_FI_2
#P_FI_3
#P_FI_4
#P_FI_5
#P_FI_6
#P_FI_7
#goal: have it pull Temp1, P3 out of all possible options
colnames(dataset1)
dataset1$Temperature <- dataset1$Temp1
dataset1$Precipitation <- dataset1$P3
colnames(dataset1)
#goal: have it pull T_1_1, P_FI_3 out of all possible options
colnames(dataset2)
dataset2$Temperature <- dataset2$T_1_1
dataset2$Precipitation <- dataset2$P_FI_3
colnames(dataset2)
#goal: have it pull T_1_2, P out of all possible options
colnames(dataset3)
dataset3$Temperature <- dataset3$T_1_2
dataset3$Precipitation <- dataset3$P
colnames(dataset3)
Upvotes: 0
Views: 68
Reputation: 66480
Here's a proof of concept using a few of the categories. My approach is to take the column names, apply some logic to categorize them, and then pick the alphabetic first one for each category.
Some of the categories have overlapping qualities, so for case_when
you'd want to start from the most specific, ie look for "P_" before you look for "P", since only the first one will get "P_1" while the second would include both "P1" and "P_1". The specifics are an implementation detail that you will be in a better position to figure out, but I hope this points you in a good direction.
In this case, I used some of your categories from the original draft of the question. I didn't pick out the "P" category where presumably you could use regex or other criteria to exclude Petal.Length even though it starts with P.
subset_df <- function(df) {
colnames_df = data.frame(colname = colnames(df)) %>%
mutate(category = case_when(
colname %>% str_starts("Temp") ~ "Temp",
colname %>% str_starts("S_") ~ "S_",
colname %>% str_starts("S") & colname %>% str_detect("_") ~ "S1_",
colname %>% str_starts("S") ~ "S",
TRUE ~ "other"
)) %>%
arrange(colname) %>%
slice(1, .by = category)
df %>% select(any_of(colnames_df$colname))
}
subset_df(dataset1)
# outputs columns "P" "S2_1" "Sepal.Length" "Temp1"
subset_df(dataset2)
# outputs columns "P_FI_3" "S_1_6" "Sepal.Length"
subset_df(dataset3)
# outputs columns "Petal.Length" "S" "S_2_1"
Upvotes: 1