Reputation: 885
I have the following data set:
id pnum t1 t2 t3 w1 w2 w3
1 1 w r r 1 1 1
1 2 o o w 0 0 1
1 3 o w w 1 1 1
2 1 o w t 1 0 1
2 2 s s s 1 0 1
2 3 s s s 1 0 1
Id
defines the group membership.
Based on id
and pnum I would like to identify the common
measurements reported with pnum 3
at time t
with respect to w
.
In other words id
and pnum
define different individuals who took some measurements. In some cases the measurement was taken together in other cases the measurement was taken alone. If taken together than at 'w'
we have value 1
.
For example:
Common activities at time t:
Id 1
pnum 1
at t1
reported (eg. 1
) that a measurement was taken with from the group, more specifically with id1/pnum3
. If the measurement data taken together is common in both groups I would like to save it.
Uncommon activities at time t:
Id 2
pnum 1
at t1 reported (eg. 1) that a measurement was taken with from the group, more specifically with id2/pnum2
and pnum 3.
In this case the measurement data taken together is uncommon between id2/pnum1
and pnum 2
as well between id2/pnum1
and pnum 3
. I don't want to save these measurements. But, I would like to save the common one reported between id2/pnum2
and pnum 3.
Generic example id 1:
In the group with id 1 pnum1 and pnum3 at t1 took together a measurement. Pnum 1
reported w
, pnum 2
and pnum 3
reported o
. This means that pnum 2
and pnum 3
reported the same measurement. However, when I look at w1 I could observe that they were not together when they did as at w1
pnum 2
is 0
and pnum 3
is 1
. In other words, even though the measurement are common for pnum 2
and pnum 3
as they were not taken together I don't want to save the case. I would need to report if they reported the same measurements or not. In this case pnum1 reported w
while pnum 3
reported o
, so the measurements don't match. Therefore I coded 0
. I don't want to save the case.
I would like to identify the common measurements that were taken together at time t.
Output:
id pnum t1 t2 t3
1 1 0 0 0
1 2 0 0 w
1 3 0 s w
2 1 0 0 0
2 2 s 0 s
2 3 s 0 s
Sample data:
df<-structure(list(id=c(1,1,1,2,2,2),pnum=c(1,2,3,1,2,3), t1=c("w","o","o","o","s","s"), t2=c("r","o","w","w","s","s"),t3 = c("r","w","w","t","s","s"), w1= c(1,0,1,1,1,1), w2 = c(1,0,1,0,0,0), w3 = c(1,1,1,1,1,1)), row.names = c(NA, 6L), class = "data.frame")
Upvotes: 1
Views: 76
Reputation: 1713
I don't get why in your expected output there is an "s" at [id1, pnum3, t2]
- other than that, I think the following might help you:
First of all, pivoting your data to a "longer" format, where you can group by time, helps you to generalize your code.
library(dplyr)
library(tidyr)
df_longer <- df %>%
pivot_longer(
cols = matches("^[tw]\\d+$"),
names_to = c(".value","time"),
names_pattern = "([tw])(\\d+)"
)
The above pivots your data to look like this:
> head(df_longer)
# A tibble: 6 x 5
id pnum time t w
<dbl> <dbl> <chr> <chr> <dbl>
1 1 1 1 w 1
2 1 1 2 r 1
3 1 1 3 r 1
4 1 2 1 o 0
5 1 2 2 o 0
6 1 2 3 w 1
Now, you can easily group it up and identify those individuals that have given common answers at any given time:
common_answers <- df_longer %>%
arrange(id, time, pnum) %>%
filter(w == 1) %>% # throw out if the answer was given individually
select(-w) %>% # w not needed anymore
group_by(id, time, t) %>% # group by selected answer
filter(n() > 1) %>% # keep only answers given >1 times
ungroup()
This presents you with only only a filtered set of your data where answers were given commonly in group:
> common_answers
# A tibble: 6 x 4
id pnum time t
<dbl> <dbl> <chr> <chr>
1 1 2 3 w
2 1 3 3 w
3 2 2 1 s
4 2 3 1 s
5 2 2 3 s
6 2 3 3 s
// ADDITION:
In case you have to rely on the "wide" format in your output, you can obviously retain all data, modify t
so that it only retains its value when it is given commonly by >1 subject and then widen your df again:
common_answers_wide <- df_longer %>%
group_by(id, time, w, t) %>%
mutate(
# retain t only when the response has been given by >1 subject
t = case_when(
w == 0 ~ "0",
n() > 1 ~ t,
T ~ "0"
)
) %>%
ungroup() %>%
select(-w) %>%
pivot_wider(
names_from = time, names_prefix = "t", names_sort = T,
values_from = t
)
That gives you exactly the desired output:
> common_answers_wide
# A tibble: 6 x 5
id pnum t1 t2 t3
<dbl> <dbl> <chr> <chr> <chr>
1 1 1 0 0 0
2 1 2 0 0 w
3 1 3 0 0 w
4 2 1 0 0 0
5 2 2 s 0 s
6 2 3 s 0 s
Upvotes: 1