ginger_cat
ginger_cat

Reputation: 105

How to create new columns based on values and names of existing columns in R?

I have a data set with a number of columns such as 1_2014_precip for January 2014 precipitation. This is a merge of two data sets. The first are agronomic variables, for example grain yield, which were collected in a given year. The second is weather data, which was downloaded for the entire period of time that any experiment was conducted. So I collected grain yield in one year, for example 2013, and I currently have columns for the weather that was experienced at that site in 2011-2019. I want to have a data set where I only have the data that corresponds to the year the yield was collected.

This data set spans all 12 months, 7 weather variables, and 10 years. I want to make columns such as 2_mean_temp, corresponding to February mean temperature, and use the column Year to tell R where to look for the correct data (so if the entry in Year is 2019, the script would pull the value for the new column 2_mean_temp from the existing column 2_2019_mean_temp. I'm including an example of the data from two weather variables in two months in two years to give an idea of how I need this to be manipulated. I have figured out how to do this in Python, but for workflow purposes, I need to be able to do it in R. My main issue in R is I don't know how to tell R to go to a different column based on the value of a given column- I can't generate even the first column without automation.

    dput(head(df))
       structure(list(Experiment = c("IREE- N Rate", "IREE- N Rate", 
       "IREE- N Rate", "IREE- N Rate", "IREE- N Rate", "IREE- N Rate"), 
        Site = c("Waseca", "Waseca", "Waseca", "Waseca", "Waseca", "Waseca"), 
        Year = c(2013L, 2013L, 2013L, 2013L, 2014L, 2014L),
        `1_2013_mean_temp` = c(-8.58677419354839, -8.58677419354839, -8.58677419354839, -8.58677419354839, -8.58677419354839, -8.58677419354839), 
        `1_2013_precip` = c(14.17, 14.17, 14.17, 14.17, 14.17, 14.17), 
        `1_2014_mean_temp` = c(-14.0787096774194, -14.0787096774194, -14.0787096774194, -14.0787096774194, -14.0787096774194, -14.0787096774194), 
        `1_2014_precip` = c(21.97, 21.97, 21.97, 21.97, 21.97, 21.97), 
        `2_2013_mean_temp` = c(-7.22428571428571, -7.22428571428571, -7.22428571428571, -7.22428571428571, -7.22428571428571, -7.22428571428571), 
        `2_2013_precip` = c(27.94, 27.94, 27.94, 27.94, 27.94, 27.94), 
        `2_2014_mean_temp` = c(-13.5003571428571, -13.5003571428571, -13.5003571428571, -13.5003571428571, -13.5003571428571, -13.5003571428571), 
        `2_2014_precip` = c(28.95, 28.95, 28.95, 28.95, 28.95, 28.95)), row.names = c(195L, 223L, 245L, 271L, 196L, 224L), class = "data.frame")

This is what I would like this sample of data to look like after it's been manipulated. Notice how there are no more years in the column names, and the data has been moved from the respective columns matching the appropriate year (2013_month_variable in the first four cases, 2014_month_variable in the last two cases).

df2 <- data.frame(Experiment = c("IREE- N Rate", "IREE- N Rate", "IREE- N Rate", "IREE- N Rate", "IREE- N Rate", "IREE- N Rate"),
                  Site = c("Waseca", "Waseca", "Waseca", "Waseca", "Waseca", "Waseca"),
                  Year = c(2013, 2013, 2013, 2013, 2014, 2014),
                  1_mean_temp = c(-8.585774, -8.585774, -8.585774, -8.585774, -14.07871, -14.07871),
                  1_precip = c(14.17, 14.17, 14.17, 14.17, 21.97, 21.97),
                  2_mean_temp = c(-7.224286,-7.224286, -7.224286, -7.224286, -13.50036, -13.50036),
                  2_precip = c(27.94, 27.94, 27.94, 27.94, 28.95, 28.95))

Here's how I did it in Python.

    months = ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12']
    variables = ['mean_temp', 'mean_max_temp', 'mean_min_temp', 'min_min_temp', 'mean_rh', 'precip', 'VPD']
    years = [2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019]
    
   for m in months:
           for var in variables:
               for year in years:
                    try:
                        df.loc[(df['Year'] == year), '_' + m + '_' + var] = df[m + '_' + year + '_' + var]
                    except:
                        print(year + '_' + m + '_' + var)

I can recreate the column names using a for loop in R, and after that is where I get stuck. I've tried doing it without automation, but I can't seem to find a way to get R to make which column R queries for a value based on the value of a different column.

years <- list("2013", "2014")
months <- list("1", "2")
vars <- list("mean_temp", "precip")
for (year in years) (
  for (month in months) (
    for (var in vars) {
      x = paste(year,"_", month,"_",var, sep="") 
    }
  )
)

Reshaping by year won't solve the problem, because it would create nonsensical columns like "2013_2011_1_mean_temp" and would not automate the connection of the weather data to the appropriate year in which the grain was collected.

Upvotes: 4

Views: 660

Answers (5)

da11an
da11an

Reputation: 731

Method 1: This should work pretty generally, [edit: but not if column types differ]:

library(dplyr)
library(tibble)
library(tidyr)

df_left <- df %>% select(Experiment:Year)
df_right <- df %>% select(-Experiment:-Year)

df_right_new <- 
  df_right %>%
  pivot_longer(everything()) %>%
  separate(name, into = c("Month", "Year", "Property"), extra = 'merge') %>%
  unite("Month_Property", c(Month, Property) ) %>%
  pivot_wider(names_from = Month_Property, values_from = value, values_fn = unique) %>%
  mutate(Year = as.integer(Year))

df_new <- left_join(df_left, df_right_new)

Output of df_new

    Experiment   Site Year 1_mean_temp 1_precip 2_mean_temp 2_precip
1 IREE- N Rate Waseca 2013   -8.586774    14.17   -7.224286    27.94
2 IREE- N Rate Waseca 2013   -8.586774    14.17   -7.224286    27.94
3 IREE- N Rate Waseca 2013   -8.586774    14.17   -7.224286    27.94
4 IREE- N Rate Waseca 2013   -8.586774    14.17   -7.224286    27.94
5 IREE- N Rate Waseca 2014  -14.078710    21.97  -13.500357    28.95
6 IREE- N Rate Waseca 2014  -14.078710    21.97  -13.500357    28.95

Method 2 (improved): Works across mixed data types, faster than first solution:

library(dplyr)
library(tibble)
library(tidyr)
library(purrr)

df_left <- df %>% select(Experiment:Year)
df_right <- df %>% select(-Experiment:-Year) %>% distinct()
year_name <- names(df_right) %>%
  strsplit(fixed = TRUE, split = "_") %>%
  sapply(function(i) c(i[2], paste(i[-2], collapse = "_")))

df_new <- lapply(seq(unique(year_name[2,])), function(i) {
  name_match <- which(year_name[2,] == unique(year_name[2, i]))
  tibble(
    Year = as.integer(year_name[1, name_match]),
    Value = unlist(df_right[name_match])
  ) %>% rename(!!unique(year_name[2,])[[i]] := Value)
}) %>% 
  append(list(as_tibble(df_left)), 0) %>%
  purrr::reduce(left_join, by = "Year")

Upvotes: 1

andrew_reece
andrew_reece

Reputation: 21264

You can accomplish your goal without any pivoting or explicit loops, by matching column names against the value of Year in each row, using rowwise and map.

This creates a list column, which you can unnest to produce the correct values for each year's data.

library(tidyverse)

cols <- colnames(df)
df <- tibble(df)

df %>%
  mutate(Year = as.character(Year)) %>%
  rowwise() %>%
  mutate(
    tmp_df = list(
      df %>% select(one_of(cols[map_lgl(cols, ~str_detect(., c_across(Year)))])) %>% 
        rename_with(~str_replace(., "_\\d{4}", "")) %>%
        slice(1)
      )
    ) %>%
  select(Experiment, Site, Year, tmp_df) %>%
  unnest_wider(tmp_df)

Output:

# A tibble: 6 x 7
  Experiment   Site   Year  `1_mean_temp` `1_precip` `2_mean_temp` `2_precip`
  <chr>        <chr>  <chr>         <dbl>      <dbl>         <dbl>      <dbl>
1 IREE- N Rate Waseca 2013          -8.59       14.2         -7.22       27.9
2 IREE- N Rate Waseca 2013          -8.59       14.2         -7.22       27.9
3 IREE- N Rate Waseca 2013          -8.59       14.2         -7.22       27.9
4 IREE- N Rate Waseca 2013          -8.59       14.2         -7.22       27.9
5 IREE- N Rate Waseca 2014         -14.1        22.0        -13.5        29.0
6 IREE- N Rate Waseca 2014         -14.1        22.0        -13.5        29.0

Upvotes: 2

s_baldur
s_baldur

Reputation: 33498

Using data.table and abusing .SD a bit makes for a concise solution:

library(data.table)
setDT(df)
idvars <- c("Experiment", "Site", "Year") 
df2 <- df[, .SD[, .SD, .SDcols = names(.SD) %flike% last(.BY)], by = idvars]
setnames(df2, sub("\\d{4}_", "", names(df2)))


#      Experiment   Site Year 1_mean_temp 1_precip 2_mean_temp 2_precip
# 1: IREE- N Rate Waseca 2013   -8.586774    14.17   -7.224286    27.94
# 2: IREE- N Rate Waseca 2013   -8.586774    14.17   -7.224286    27.94
# 3: IREE- N Rate Waseca 2013   -8.586774    14.17   -7.224286    27.94
# 4: IREE- N Rate Waseca 2013   -8.586774    14.17   -7.224286    27.94
# 5: IREE- N Rate Waseca 2014  -14.078710    21.97  -13.500357    28.95
# 6: IREE- N Rate Waseca 2014  -14.078710    21.97  -13.500357    28.95

Using only base R:

dfspl <- split(df, df$Year)
for (i in seq_along(dfspl)) {
  df2 <- dfspl[[i]][!(names(df) %in% idvars | grepl(names(dfspl)[i], names(df)))] <-
    NULL
  names(dfspl[[i]]) <- sub("\\d{4}_", "", names(dfspl[[i]]))
}
do.call(rbind, dfspl)

#            Experiment   Site Year 1_mean_temp 1_precip 2_mean_temp 2_precip
# 2013.195 IREE- N Rate Waseca 2013   -8.586774    14.17   -7.224286    27.94
# 2013.223 IREE- N Rate Waseca 2013   -8.586774    14.17   -7.224286    27.94
# 2013.245 IREE- N Rate Waseca 2013   -8.586774    14.17   -7.224286    27.94
# 2013.271 IREE- N Rate Waseca 2013   -8.586774    14.17   -7.224286    27.94
# 2014.196 IREE- N Rate Waseca 2014  -14.078710    21.97  -13.500357    28.95
# 2014.224 IREE- N Rate Waseca 2014  -14.078710    21.97  -13.500357    28.95

Upvotes: 2

lroha
lroha

Reputation: 34291

This can be done by a round-trip pivot from long to wide and back again and filtering the data inbetween:

library(dplyr)
library(tidyr)
library(tibble)

df %>%
  rowid_to_column() %>%
  pivot_longer(-c(Experiment, Site, Year, rowid), names_pattern = "(\\d+)_(\\d{4})_(.*)", names_to = c("obs", "yr", "meas")) %>%
  filter(Year == yr) %>%
  select(-yr) %>%
  pivot_wider(names_from = c(obs, meas), values_from = value, names_prefix = "X")

# A tibble: 6 x 8
  rowid Experiment   Site    Year X1_mean_temp X1_precip X2_mean_temp X2_precip
  <int> <chr>        <chr>  <int>        <dbl>     <dbl>        <dbl>     <dbl>
1     1 IREE- N Rate Waseca  2013        -8.59      14.2        -7.22      27.9
2     2 IREE- N Rate Waseca  2013        -8.59      14.2        -7.22      27.9
3     3 IREE- N Rate Waseca  2013        -8.59      14.2        -7.22      27.9
4     4 IREE- N Rate Waseca  2013        -8.59      14.2        -7.22      27.9
5     5 IREE- N Rate Waseca  2014       -14.1       22.0       -13.5       29.0
6     6 IREE- N Rate Waseca  2014       -14.1       22.0       -13.5       29.0

Upvotes: 1

Kevin A
Kevin A

Reputation: 169

I believe this is the result you're looking for; this solution uses tidyverse functions.

library(tidyverse)

# create empty tibble to store results
df.out <- tibble()

# loop over years
for (i in unique(df$Year)) {

  this.year <- df %>%
    # grab number of rows for this year
    filter(Year == i) %>%
    # only grab the weather columns for this specific year
    select(Experiment, Site, Year, contains(paste0("_", i, "_"))) %>%
    # this function uses a regex to rename columns with "_" and the current year by removing the part of the name with "_" then 4 numbers
    rename_with(function(x) {str_replace(x, "\\_[0-9][0-9][0-9][0-9]", "")}, contains(paste0("_", i, "_")))
  
  # add this year to your output tibble
  df.out <- df.out %>%
    bind_rows(this.year)
  
}

final data frame

Upvotes: 2

Related Questions