Reputation: 45
I'm struggling with the way to find the value of a column across other columns of a data.frame. I'd be thankful if anyone could help me. These are a simplified form of my data:
library(data.table)
df<-data.table(personid<-c(101, 102, 103, 104, 105, 201, 202, 203, 301, 302, 401),
hh_id<-c(1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 4),
fatherid<-c(NA, NA, 101, 101, 101, NA, NA, 201, NA, NA, NA),
fatherid_1<-c(NA,101, 101, 101, NA, NA, 201, NA, NA, NA, NA),
fatherid_2<-c(101, 101, 101, NA, NA, 201, NA, NA, NA, NA, NA),
fatherid_3<-c(101, 101, NA, NA, NA, NA, NA, NA, NA, NA, NA),
fatherid_4<-c(101, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA),
fatherid_5<-c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA))
(the real one is 185000 rows and has up to 17 variables such as fatherid_1
, fatherid_2
... fatherid_17
)
What I'm trying to do is to create a variable that checks whether the value of variable personid
of a given row is the same as any of the values of variables fatherid_1
to fatherid_5
in the same row. For the given data, the outcome should be:
df$result <- c(1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0)
but I need something to do it automatically, over 17 columns such as fatherid_1
, and a lot of rows
In case you want to understand the sense of my calculations, I'm trying to build household grids without using only information in the same row
Thank you very much in advance!
Upvotes: 2
Views: 1005
Reputation: 9858
Two tidyverse solutions:
1-) You can use dplyr's new if_any()
, .==
and tidyr's replace_na()
. if_any()
obviates the need for either rowwise()
or reduce()
/ Reduce()
:
library(dplyr)
library(tidyr)
df%>%mutate(result=replace_na(if_any(matches('fatherid'), ~.==personid), 0))
2-) In a rowwise()
operation, you can apply a function to check the condition for all the selected columns with map()
, c_across()
and %in%
, which generates a logical vector. This can then be collapsed/reduce()
d in the same call.
library(purrr)
library(dplyr)
df%>%rowwise()%>%mutate(result=as.integer(reduce(map(c_across(fatherid_1:fatherid_5), ~. %in% personid), `|`)))
Or with pipes, for clarity:
#option 1
df%>%rowwise()%>%
mutate(result=map(c_across(fatherid_1:fatherid_5), ~. %in% personid)%>%
reduce(`|`)%>%
as.integer())
#option 2
df%>%rowwise()%>%
mutate(result=map_int(c_across(fatherid_1:fatherid_5), ~. %in% personid)%>%
reduce(max))
personid hh_id fatherid fatherid_1 fatherid_2 fatherid_3 fatherid_4 fatherid_5 result
1: 101 1 NA NA 101 101 101 NA 1
2: 102 1 NA 101 101 101 NA NA 0
3: 103 1 101 101 101 NA NA NA 0
4: 104 1 101 101 NA NA NA NA 0
5: 105 1 101 NA NA NA NA NA 0
6: 201 2 NA NA 201 NA NA NA 1
7: 202 2 NA 201 NA NA NA NA 0
8: 203 2 201 NA NA NA NA NA 0
9: 301 3 NA NA NA NA NA NA 0
10: 302 3 NA NA NA NA NA NA 0
11: 401 4 NA NA NA NA NA NA 0
Upvotes: 3
Reputation: 39657
A way to solve it in base using ==
to make the comparison and test if the rowSums
>0
:
+(rowSums(df[[1]] == df[,3:8], na.rm=TRUE) > 0)
# [1] 1 0 0 0 0 1 0 0 0 0 0
Or using any
with apply
.
+apply(df[[1]] == df[,3:8], 1, any, na.rm = TRUE)
# [1] 1 0 0 0 0 1 0 0 0 0 0
Or the same but using pipes:
(df[[1]] == df[,3:8]) |> rowSums(na.rm=TRUE) |> (`>`)(0) |> as.integer()
(df[[1]] == df[,3:8]) |> apply(1, any, na.rm=TRUE) |> as.integer()
Upvotes: 2
Reputation: 26218
If you don't want to use rowwise
then this can also be an alternative
library(dplyr)
df %>% group_by(personid) %>%
mutate(res = sum(cur_group() %in% cur_data()))
# A tibble: 11 x 9
# Groups: personid [11]
personid hh_id fatherid fatherid_1 fatherid_2 fatherid_3 fatherid_4 fatherid_5 res
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl> <int>
1 101 1 NA NA 101 101 101 NA 1
2 102 1 NA 101 101 101 NA NA 0
3 103 1 101 101 101 NA NA NA 0
4 104 1 101 101 NA NA NA NA 0
5 105 1 101 NA NA NA NA NA 0
6 201 2 NA NA 201 NA NA NA 1
7 202 2 NA 201 NA NA NA NA 0
8 203 2 201 NA NA NA NA NA 0
9 301 3 NA NA NA NA NA NA 0
10 302 3 NA NA NA NA NA NA 0
11 401 4 NA NA NA NA NA NA 0
Created on 2021-06-09 by the reprex package (v2.0.0)
If you want to be safe excluding hh_id
you could use
df %>% group_by(personid) %>%
mutate(res = sum(cur_group() %in% cur_data()[-1]))
Upvotes: 2
Reputation: 21908
We can also use the following solution with pmap
from purrr
package:
library(dplyr)
library(purrr)
df %>%
mutate(result = pmap_dbl(., ~ {x <- c(...)[-c(1, 2)];
if_else(all(x[!is.na(x)] != c(...)[1]) | all(is.na(x)), 0, 1)}))
personid hh_id fatherid fatherid_1 fatherid_2 fatherid_3 fatherid_4 fatherid_5 result
1: 101 1 NA NA 101 101 101 NA 1
2: 102 1 NA 101 101 101 NA NA 0
3: 103 1 101 101 101 NA NA NA 0
4: 104 1 101 101 NA NA NA NA 0
5: 105 1 101 NA NA NA NA NA 0
6: 201 2 NA NA 201 NA NA NA 1
7: 202 2 NA 201 NA NA NA NA 0
8: 203 2 201 NA NA NA NA NA 0
9: 301 3 NA NA NA NA NA NA 0
10: 302 3 NA NA NA NA NA NA 0
11: 401 4 NA NA NA NA NA NA 0
Upvotes: 2
Reputation: 887108
The OP's dataset is a data.table
object. We can use data.table
methods. Loop over the 'fatherid' columns, check if the 'personid' is equal to the column value, and Reduce
to a single vector
library(data.table)
df[, result := +(Reduce(`|`, lapply(.SD, function(x)
x == personid & !is.na(x)))), .SDcols = patterns('fatherid')]
-output
df
personid hh_id fatherid fatherid_1 fatherid_2 fatherid_3 fatherid_4 fatherid_5 result
1: 101 1 NA NA 101 101 101 NA 1
2: 102 1 NA 101 101 101 NA NA 0
3: 103 1 101 101 101 NA NA NA 0
4: 104 1 101 101 NA NA NA NA 0
5: 105 1 101 NA NA NA NA NA 0
6: 201 2 NA NA 201 NA NA NA 1
7: 202 2 NA 201 NA NA NA NA 0
8: 203 2 201 NA NA NA NA NA 0
9: 301 3 NA NA NA NA NA NA 0
10: 302 3 NA NA NA NA NA NA 0
11: 401 4 NA NA NA NA NA NA 0
Upvotes: 1