Reputation: 1089
I have a dataframe (simple example below) in which each user fills out a questionnaire a different number of times, and each completion of a questionnaire results in a row in the dataframe. In my simple example, Users A, C and D have entries on 5 days, but user B has entries on only 4:
> df
UserId Days_From_First_Use Q1 Q2 Q3
1 A 0 3 2 1
2 A 1 1 0 0
3 A 2 1 1 0
4 A 3 0 2 0
5 A 4 1 1 1
6 B 0 4 8 2
7 B 2 2 2 1
8 B 4 5 6 5
9 B 5 4 5 5
10 C 0 5 7 2
11 C 1 2 2 2
12 C 2 5 5 4
13 C 3 6 5 3
14 C 4 6 6 4
15 D 0 5 3 5
16 D 1 5 3 4
17 D 2 4 2 6
18 D 3 0 0 1
19 D 4 1 1 1
I now compute the time-series volatilities for each user as follows:
> df <- df %>%
+ group_by(UserId) %>%
+ mutate(across(all_of(c("Q1", "Q2", "Q3")), sd,.names = paste0("Sigma_", "{.col}"))) %>%
+ ungroup()
> df
# A tibble: 19 x 8
UserId Days_From_First_Use Q1 Q2 Q3 Sigma_Q1 Sigma_Q2 Sigma_Q3
<fct> <int> <int> <int> <int> <dbl> <dbl> <dbl>
1 A 0 3 2 1 1.10 0.837 0.548
2 A 1 1 0 0 1.10 0.837 0.548
3 A 2 1 1 0 1.10 0.837 0.548
4 A 3 0 2 0 1.10 0.837 0.548
5 A 4 1 1 1 1.10 0.837 0.548
6 B 0 4 8 2 1.26 2.5 2.06
7 B 2 2 2 1 1.26 2.5 2.06
8 B 4 5 6 5 1.26 2.5 2.06
9 B 5 4 5 5 1.26 2.5 2.06
10 C 0 5 7 2 1.64 1.87 1
11 C 1 2 2 2 1.64 1.87 1
12 C 2 5 5 4 1.64 1.87 1
13 C 3 6 5 3 1.64 1.87 1
14 C 4 6 6 4 1.64 1.87 1
15 D 0 5 3 5 2.35 1.30 2.30
16 D 1 5 3 4 2.35 1.30 2.30
17 D 2 4 2 6 2.35 1.30 2.30
18 D 3 0 0 1 2.35 1.30 2.30
19 D 4 1 1 1 2.35 1.30 2.30
Now comes the part that's giving me trouble: I would like to compute the median sd across all users to allow me to identify two subsets of users with above median and below median sd. I can't just compute a median sd across all rows, because the number of observations for each user is different. I can however group by Days_From_First_Use and then compute the median sd on each day. As all users have a Day 0 (their very first day), the median sd on this day is the value I want. So I type:
> df <- df %>%
+ group_by(UserId) %>%
+ mutate(across(all_of(c("Q1", "Q2", "Q3")), sd,.names = paste0("Sigma_", "{.col}"))) %>%
+ ungroup() %>%
+ group_by(Days_From_First_Use) %>%
+ mutate(across(all_of(paste0("Sigma_", c("Q1", "Q2", "Q3"))), median ,.names = paste0("Median_", "{.col}"))) %>%
+ ungroup()
>
> df
# A tibble: 19 x 11
UserId Days_From_First_Use Q1 Q2 Q3 Sigma_Q1 Sigma_Q2 Sigma_Q3 Median_Sigma_Q1 Median_Sigma_Q2 Median_Sigma_Q3
<fct> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A 0 3 2 1 1.10 0.837 0.548 1.45 1.59 1.53
2 A 1 1 0 0 1.10 0.837 0.548 1.64 1.30 1
3 A 2 1 1 0 1.10 0.837 0.548 1.45 1.59 1.53
4 A 3 0 2 0 1.10 0.837 0.548 1.64 1.30 1
5 A 4 1 1 1 1.10 0.837 0.548 1.45 1.59 1.53
6 B 0 4 8 2 1.26 2.5 2.06 1.45 1.59 1.53
7 B 2 2 2 1 1.26 2.5 2.06 1.45 1.59 1.53
8 B 4 5 6 5 1.26 2.5 2.06 1.45 1.59 1.53
9 B 5 4 5 5 1.26 2.5 2.06 1.26 2.5 2.06
10 C 0 5 7 2 1.64 1.87 1 1.45 1.59 1.53
11 C 1 2 2 2 1.64 1.87 1 1.64 1.30 1
12 C 2 5 5 4 1.64 1.87 1 1.45 1.59 1.53
13 C 3 6 5 3 1.64 1.87 1 1.64 1.30 1
14 C 4 6 6 4 1.64 1.87 1 1.45 1.59 1.53
15 D 0 5 3 5 2.35 1.30 2.30 1.45 1.59 1.53
16 D 1 5 3 4 2.35 1.30 2.30 1.64 1.30 1
17 D 2 4 2 6 2.35 1.30 2.30 1.45 1.59 1.53
18 D 3 0 0 1 2.35 1.30 2.30 1.64 1.30 1
19 D 4 1 1 1 2.35 1.30 2.30 1.45 1.59 1.53
Bu way of reference, the (incorrect) medians over the entire dataframe are 1.64, 1.30 and 1 respectively while the correct medians are 1.45, 1.59 and 1.53.
I now want to replace all the median sds with the sd on Day 0. Once I do this, I can correctly split the dataframe into high sd and low sd subsets.
Question: How can I copy the correct Day 0 medians throughout these three columns, and then create new columns with Low sd and High sd subsets defined by the sd of the user relative to the median volatility for each question?
Sincerely and with many thanks in advance
Thomas Philips
Upvotes: 0
Views: 297
Reputation: 1328
I think you could calculate the correct median for each user using only the first record for each user, and then left_join
.
df =
tibble(
UserId = c("A", "A", "A", "A", "A", "B", "B", "B", "B", "C", "C", "C", "C", "C", "D", "D", "D", "D", "D"),
DFFU = c(0, 1, 2, 3, 4, 0, 2, 4, 5, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4),
Q1 = c(3, 1, 1, 0, 1, 4, 2, 5, 4, 5, 2, 5, 6, 6, 5, 5, 4, 0, 1),
Q2 = c(2,0,1,2,1,8,2,6,5,7,2,5,5,6,3,3,2,0,1),
Q3 = c(1,0,0,0,1,2,1,5,5,2,2,4,3,4,5,4,6,1,1)
)
df <- df %>%
group_by(UserId) %>%
mutate(across(all_of(c("Q1", "Q2", "Q3")), sd,.names = paste0("Sigma_", "{.col}"))) %>%
ungroup()
df %>%
filter(DFFU == 0) %>%
transmute(UserId = UserId, across(all_of(paste0("Sigma_", c("Q1", "Q2", "Q3"))), median ,.names = paste0("Median_", "{.col}"))) %>%
{left_join(df, .)}
Yielding:
> df
# A tibble: 19 x 11
UserId DFFU Q1 Q2 Q3 Sigma_Q1 Sigma_Q2 Sigma_Q3 Median_Sigma_Q1 Median_Sigma_Q2 Median_Sigma_Q3
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A 0 3 2 1 1.10 0.837 0.548 1.45 1.59 1.53
2 A 1 1 0 0 1.10 0.837 0.548 1.45 1.59 1.53
3 A 2 1 1 0 1.10 0.837 0.548 1.45 1.59 1.53
4 A 3 0 2 0 1.10 0.837 0.548 1.45 1.59 1.53
5 A 4 1 1 1 1.10 0.837 0.548 1.45 1.59 1.53
6 B 0 4 8 2 1.26 2.5 2.06 1.45 1.59 1.53
7 B 2 2 2 1 1.26 2.5 2.06 1.45 1.59 1.53
8 B 4 5 6 5 1.26 2.5 2.06 1.45 1.59 1.53
9 B 5 4 5 5 1.26 2.5 2.06 1.45 1.59 1.53
10 C 0 5 7 2 1.64 1.87 1 1.45 1.59 1.53
11 C 1 2 2 2 1.64 1.87 1 1.45 1.59 1.53
12 C 2 5 5 4 1.64 1.87 1 1.45 1.59 1.53
13 C 3 6 5 3 1.64 1.87 1 1.45 1.59 1.53
14 C 4 6 6 4 1.64 1.87 1 1.45 1.59 1.53
15 D 0 5 3 5 2.35 1.30 2.30 1.45 1.59 1.53
16 D 1 5 3 4 2.35 1.30 2.30 1.45 1.59 1.53
17 D 2 4 2 6 2.35 1.30 2.30 1.45 1.59 1.53
18 D 3 0 0 1 2.35 1.30 2.30 1.45 1.59 1.53
19 D 4 1 1 1 2.35 1.30 2.30 1.45 1.59 1.53
One of the reasons your analysis is getting so weird, tho, is that you are breaking tidy data principles. In your original data set, each row represents one survey, but the standard deviation applies to each student, not to each survey. So the standard deviation values should appear in a table with 5 rows, one row for each student. Then the median represent the population of students. There is only one population, so there should only be one row. Therefore, I'd recommend:
sd_df <-
df %>%
group_by(UserId) %>%
summarize(
across(
all_of(c("Q1", "Q2", "Q3")),
.fns = sd,
.names = paste0("Sigma_", "{.col}")
)
)
median_sd_df <-
sd_df %>%
summarize(
across(
all_of(paste0("Sigma_", c("Q1", "Q2", "Q3"))),
.fns = median,
.names = paste0("Sigma_", "{.col}")
),
n = n()
)
which gives you:
> sd_df
# A tibble: 4 x 5
UserId Sigma_Q1 Sigma_Q2 Sigma_Q3 n
<chr> <dbl> <dbl> <dbl> <int>
1 A 1.10 0.837 0.548 5
2 B 1.26 2.5 2.06 4
3 C 1.64 1.87 1 5
4 D 2.35 1.30 2.30 5
> median_sd_df
# A tibble: 1 x 3
Median_Sigma_Q1 Median_Sigma_Q2 Median_Sigma_Q3
<dbl> <dbl> <dbl>
1 1.45 1.59 1.53
Upvotes: 1
Reputation: 1089
Solved it, but in a clumsy way: first group by Days_From_First_Use, make use of the fact that we have all the users on Day 0, compute the median, extract the first row into another dataframe, and then overwrite all the relevant rows with this new dataframe.
> df_median_sigma <- df %>%
arrange(Days_From_First_Use) %>%
select(starts_with("Median_Sigma")) %>%
filter(row_number( ) == 1)
> df_median_sigma
# A tibble: 1 x 3
Median_Sigma_Q1 Median_Sigma_Q2 Median_Sigma_Q3
<dbl> <dbl> <dbl>
1 1.45 1.59 1.53
Finally, overwrite all the relevant columns in df with the correct medians:
> df[paste0("Median_Sigma_", c("Q1", "Q2", "Q3"))] <- df_median_sigma
> df
# A tibble: 19 x 11
UserId Days_From_First_Use Q1 Q2 Q3 Sigma_Q1 Sigma_Q2 Sigma_Q3 Median_Sigma_Q1 Median_Sigma_Q2 Median_Sigma_Q3
<fct> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A 0 3 2 1 1.10 0.837 0.548 1.45 1.59 1.53
2 A 1 1 0 0 1.10 0.837 0.548 1.45 1.59 1.53
3 A 2 1 1 0 1.10 0.837 0.548 1.45 1.59 1.53
4 A 3 0 2 0 1.10 0.837 0.548 1.45 1.59 1.53
5 A 4 1 1 1 1.10 0.837 0.548 1.45 1.59 1.53
6 B 0 4 8 2 1.26 2.5 2.06 1.45 1.59 1.53
7 B 2 2 2 1 1.26 2.5 2.06 1.45 1.59 1.53
8 B 4 5 6 5 1.26 2.5 2.06 1.45 1.59 1.53
9 B 5 4 5 5 1.26 2.5 2.06 1.45 1.59 1.53
10 C 0 5 7 2 1.64 1.87 1 1.45 1.59 1.53
11 C 1 2 2 2 1.64 1.87 1 1.45 1.59 1.53
12 C 2 5 5 4 1.64 1.87 1 1.45 1.59 1.53
13 C 3 6 5 3 1.64 1.87 1 1.45 1.59 1.53
14 C 4 6 6 4 1.64 1.87 1 1.45 1.59 1.53
15 D 0 5 3 5 2.35 1.30 2.30 1.45 1.59 1.53
16 D 1 5 3 4 2.35 1.30 2.30 1.45 1.59 1.53
17 D 2 4 2 6 2.35 1.30 2.30 1.45 1.59 1.53
18 D 3 0 0 1 2.35 1.30 2.30 1.45 1.59 1.53
19 D 4 1 1 1 2.35 1.30 2.30 1.45 1.59 1.53
Works, but a bit clumsy. I suspect dplyr has a more elegant way to do it, but I haven't been able to find it.
Upvotes: 0