David Gil Solsona
David Gil Solsona

Reputation: 45

Find the value of a column over a set of columns in a data frame in R

I'm struggling with the way to find the value of a column across other columns of a data.frame. I'd be thankful if anyone could help me. These are a simplified form of my data:

library(data.table)

df<-data.table(personid<-c(101, 102, 103, 104, 105, 201, 202, 203, 301, 302, 401),
       hh_id<-c(1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 4),
       fatherid<-c(NA, NA, 101, 101, 101, NA, NA, 201, NA, NA, NA),
       fatherid_1<-c(NA,101, 101, 101, NA, NA, 201, NA, NA, NA, NA),
       fatherid_2<-c(101, 101, 101, NA, NA, 201, NA, NA, NA, NA, NA),
       fatherid_3<-c(101, 101, NA, NA, NA, NA, NA, NA, NA, NA, NA),
       fatherid_4<-c(101, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA),
       fatherid_5<-c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA))

(the real one is 185000 rows and has up to 17 variables such as fatherid_1, fatherid_2... fatherid_17)

What I'm trying to do is to create a variable that checks whether the value of variable personid of a given row is the same as any of the values of variables fatherid_1 to fatherid_5 in the same row. For the given data, the outcome should be:

df$result <- c(1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0) 

but I need something to do it automatically, over 17 columns such as fatherid_1, and a lot of rows

In case you want to understand the sense of my calculations, I'm trying to build household grids without using only information in the same row

Thank you very much in advance!

Upvotes: 2

Views: 1005

Answers (5)

GuedesBF
GuedesBF

Reputation: 9858

Two tidyverse solutions:

1-) You can use dplyr's new if_any(), .== and tidyr's replace_na(). if_any() obviates the need for either rowwise() or reduce()/ Reduce():

library(dplyr)
library(tidyr)

df%>%mutate(result=replace_na(if_any(matches('fatherid'), ~.==personid), 0))

2-) In a rowwise() operation, you can apply a function to check the condition for all the selected columns with map(), c_across() and %in%, which generates a logical vector. This can then be collapsed/reduce()d in the same call.

library(purrr)
library(dplyr)

df%>%rowwise()%>%mutate(result=as.integer(reduce(map(c_across(fatherid_1:fatherid_5), ~. %in% personid), `|`)))

Or with pipes, for clarity:

#option 1
df%>%rowwise()%>%
        mutate(result=map(c_across(fatherid_1:fatherid_5), ~. %in% personid)%>%
                       reduce(`|`)%>%
                       as.integer())
#option 2
df%>%rowwise()%>%
        mutate(result=map_int(c_across(fatherid_1:fatherid_5), ~. %in% personid)%>%
                       reduce(max))

    personid hh_id fatherid fatherid_1 fatherid_2 fatherid_3 fatherid_4 fatherid_5 result
 1:      101     1       NA         NA        101        101        101         NA      1
 2:      102     1       NA        101        101        101         NA         NA      0
 3:      103     1      101        101        101         NA         NA         NA      0
 4:      104     1      101        101         NA         NA         NA         NA      0
 5:      105     1      101         NA         NA         NA         NA         NA      0
 6:      201     2       NA         NA        201         NA         NA         NA      1
 7:      202     2       NA        201         NA         NA         NA         NA      0
 8:      203     2      201         NA         NA         NA         NA         NA      0
 9:      301     3       NA         NA         NA         NA         NA         NA      0
10:      302     3       NA         NA         NA         NA         NA         NA      0
11:      401     4       NA         NA         NA         NA         NA         NA      0

Upvotes: 3

GKi
GKi

Reputation: 39657

A way to solve it in base using == to make the comparison and test if the rowSums >0:

+(rowSums(df[[1]] == df[,3:8], na.rm=TRUE) > 0)
# [1] 1 0 0 0 0 1 0 0 0 0 0

Or using any with apply.

+apply(df[[1]] == df[,3:8], 1, any, na.rm = TRUE)
# [1] 1 0 0 0 0 1 0 0 0 0 0

Or the same but using pipes:

(df[[1]] == df[,3:8]) |> rowSums(na.rm=TRUE) |> (`>`)(0) |> as.integer()

(df[[1]] == df[,3:8]) |> apply(1, any, na.rm=TRUE) |> as.integer()

Upvotes: 2

AnilGoyal
AnilGoyal

Reputation: 26218

If you don't want to use rowwise then this can also be an alternative

library(dplyr)

df %>% group_by(personid) %>%
  mutate(res = sum(cur_group() %in% cur_data()))

# A tibble: 11 x 9
# Groups:   personid [11]
   personid hh_id fatherid fatherid_1 fatherid_2 fatherid_3 fatherid_4 fatherid_5   res
      <dbl> <dbl>    <dbl>      <dbl>      <dbl>      <dbl>      <dbl> <lgl>      <int>
 1      101     1       NA         NA        101        101        101 NA             1
 2      102     1       NA        101        101        101         NA NA             0
 3      103     1      101        101        101         NA         NA NA             0
 4      104     1      101        101         NA         NA         NA NA             0
 5      105     1      101         NA         NA         NA         NA NA             0
 6      201     2       NA         NA        201         NA         NA NA             1
 7      202     2       NA        201         NA         NA         NA NA             0
 8      203     2      201         NA         NA         NA         NA NA             0
 9      301     3       NA         NA         NA         NA         NA NA             0
10      302     3       NA         NA         NA         NA         NA NA             0
11      401     4       NA         NA         NA         NA         NA NA             0

Created on 2021-06-09 by the reprex package (v2.0.0)

If you want to be safe excluding hh_id you could use

df %>% group_by(personid) %>%
  mutate(res = sum(cur_group() %in% cur_data()[-1]))

Upvotes: 2

Anoushiravan R
Anoushiravan R

Reputation: 21908

We can also use the following solution with pmap from purrr package:

library(dplyr)
library(purrr)

df %>%
  mutate(result = pmap_dbl(., ~ {x <- c(...)[-c(1, 2)]; 
  if_else(all(x[!is.na(x)] != c(...)[1]) | all(is.na(x)), 0, 1)}))


    personid hh_id fatherid fatherid_1 fatherid_2 fatherid_3 fatherid_4 fatherid_5 result
 1:      101     1       NA         NA        101        101        101         NA      1
 2:      102     1       NA        101        101        101         NA         NA      0
 3:      103     1      101        101        101         NA         NA         NA      0
 4:      104     1      101        101         NA         NA         NA         NA      0
 5:      105     1      101         NA         NA         NA         NA         NA      0
 6:      201     2       NA         NA        201         NA         NA         NA      1
 7:      202     2       NA        201         NA         NA         NA         NA      0
 8:      203     2      201         NA         NA         NA         NA         NA      0
 9:      301     3       NA         NA         NA         NA         NA         NA      0
10:      302     3       NA         NA         NA         NA         NA         NA      0
11:      401     4       NA         NA         NA         NA         NA         NA      0

Upvotes: 2

akrun
akrun

Reputation: 887108

The OP's dataset is a data.table object. We can use data.table methods. Loop over the 'fatherid' columns, check if the 'personid' is equal to the column value, and Reduce to a single vector

library(data.table)
df[, result  := +(Reduce(`|`, lapply(.SD, function(x) 
      x == personid & !is.na(x)))), .SDcols = patterns('fatherid')]

-output

df
    personid hh_id fatherid fatherid_1 fatherid_2 fatherid_3 fatherid_4 fatherid_5 result
 1:      101     1       NA         NA        101        101        101         NA      1
 2:      102     1       NA        101        101        101         NA         NA      0
 3:      103     1      101        101        101         NA         NA         NA      0
 4:      104     1      101        101         NA         NA         NA         NA      0
 5:      105     1      101         NA         NA         NA         NA         NA      0
 6:      201     2       NA         NA        201         NA         NA         NA      1
 7:      202     2       NA        201         NA         NA         NA         NA      0
 8:      203     2      201         NA         NA         NA         NA         NA      0
 9:      301     3       NA         NA         NA         NA         NA         NA      0
10:      302     3       NA         NA         NA         NA         NA         NA      0
11:      401     4       NA         NA         NA         NA         NA         NA      0

Upvotes: 1

Related Questions