Reputation: 21
I have a big dataset including the variable "ID", "Date", "AS", "Value", "Condition Day 1", "Condition Day 2", and "measurement".
We measured amino acids in different subjects (ID) on two different dates (Date) with two different conditions (Condition Day 1 and 2). There are two measurements on every date.
ID | Date | AS | Value | Condition Day 1 | Condition Day 2 | measurement |
---|---|---|---|---|---|---|
1 | 2015-07-12 | TRP | 23 | 1 | 2 | 1 |
1 | 2015-07-12 | TRP | 11 | 1 | 2 | 2 |
1 | 2015-07-12 | LEU | .. | 1 | 2 | |
1 | 2015-07-12 | LEU | 1 | 2 | ||
1 | 2015-07-12 | .. | 1 | 2 | ||
1 | 2015-08-23 | TRP | 2 | 1 | ||
1 | 2015-08-23 | TRP | 2 | 1 | ||
1 | 2015-08-23 | LEU | 2 | 1 | ||
1 | 2015-08-23 | LEU | 2 | 1 | ||
... | ... | ... | 2 | 1 | ||
2 | .. | .. | ... | . | ||
2 | .. | ... | ... | .. . |
Now I am trying to create a new date variable based on the old one. The first date should be assigned the value "1", the second date the value "2". The problem I have, is that the dates change from ID to ID, since every subject was tested on a different dates ranging from 2015 to 2018.
In the next step I want to create a new variable based on condition, and measurement ranging from 1 to 4.
"1" = measurement 1 + Condition 1
"2" = measurement 2 + Condition 1
"3" = measurement 1 + Condition 2
"4" = measurement 2 + Condition 2
In the end the table should be looking something like this:
ID | AS | Value | D_C |
---|---|---|---|
1 | TRP | 23 | 1 |
1 | TRP | 11 | 2 |
1 | TRP | ... | 3 |
1 | TRP | ... | 4 |
1 | LEU | .. | 1 |
1 | LEU | ... | 2 |
1 | LEU | ... | 3 |
1 | LEU | ... | 4 |
2 | ... | ... | ... |
2 | ... | ... | ... |
I am kind of stuck here, especially with the first question. Can anyone help? Thank you so much!
*** UPDATE - MRE *** here some example data
df = data.frame(ID=c(1, 1, 1, 1, 2, 2, 2, 2),
Date=c("2015-07-15", "2015-07-15", "2015-07-30", "2015-07-30", "2016-07-15", "2016-07-15", "2016-07-30", "2016-07-30"),
AS = c("TRP", "TRP", "TRP", "TRP", "ILE", "ILE", "ILE", "ILE"),
Concentration = c(3, 2, 1, 3, 3, 2, 1, 0),
Method_Day1 = c(1, 1, 2, 2, 2, 2, 1, 1),
Method_Day2 = c(2, 2, 1, 1, 1, 1, 2, 2),
Sample = c(1, 2, 1, 2, 1, 2, 1, 2))
Upvotes: 0
Views: 1031
Reputation: 1784
In order to encode the first and second date per ID, we can group by ID, and then compare each date against the groupwise max/min.
library(tidyverse)
df1 <- df %>%
group_by(ID) %>%
dplyr::mutate(
date_encoding = case_when(
Date == min(Date) ~ 1,
Date == max(Date) ~ 2,
TRUE~0))
df1
# A tibble: 8 x 8
# Groups: ID [2]
ID Date AS Concentration Method_Day1 Method_Day2 Sample date_encoding
<dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2015-07-15 TRP 3 1 2 1 1
2 1 2015-07-15 TRP 2 1 2 2 1
3 1 2015-07-30 TRP 1 2 1 1 2
4 1 2015-07-30 TRP 3 2 1 2 2
5 2 2016-07-15 ILE 3 2 1 1 1
6 2 2016-07-15 ILE 2 2 1 2 1
7 2 2016-07-30 ILE 1 1 2 1 2
8 2 2016-07-30 ILE 0 1 2 2 2
For the second part of your question, you can again use mutate
and case_when
based on the other column conditions, but the description you provide for Condition/Method Day 1/2 is not completely clear from the question.
EDIT: as this works only for the specific case of expecting exactly two Dates per ID, I'll also add the more general solution that will work for an arbitrary number of Dates per ID:
# sort and index Dates per ID
df %>%
dplyr::distinct(ID, Date) %>%
group_by(ID) %>%
arrange(Date, .by_group = TRUE) %>%
dplyr::mutate(encoding2 = row_number()) %>%
# then join them back to the original dataframe
right_join(df, by = c("ID", "Date"))
# A tibble: 8 x 8
# Groups: ID [2]
ID Date encoding2 AS Concentration Method_Day1 Method_Day2 Sample
<dbl> <chr> <int> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 2015-07-15 1 TRP 3 1 2 1
2 1 2015-07-15 1 TRP 2 1 2 2
3 1 2015-07-30 2 TRP 1 2 1 1
4 1 2015-07-30 2 TRP 3 2 1 2
5 2 2016-07-15 1 ILE 3 2 1 1
6 2 2016-07-15 1 ILE 2 2 1 2
7 2 2016-07-30 2 ILE 1 1 2 1
8 2 2016-07-30 2 ILE 0 1 2 2
Upvotes: 1