Reputation: 117

conditional match and return values in r

I have two tables named A and B. There are 300k rows in B. Just one row in a. I want to generate a table C based on if there is a value in B match value in A for each row. If it is, return 1; If is not, return 0; Finally, get a matrix C, which includes (0, 1) with the same row as B. I use the Match function in excel, but my data is too large. Can realize it in R?

A01B A01C A01D A01E A01F A01G

id1 a A01C  NA    NA    NA 
id2 b A01C A01D   NA    NA
id3 c B01C B03D   NA    NA
id4 d A01F A01F  A01F   NA
...

A01B A01C A01D A01E A01F A01G
 0    1     0    0    0    0
 0    1     1    0    0    0
 0    0     0    0    0    0
 0    0     0    0    1    0

Upvotes: 0

Answers (3)

GKi

Reputation: 39707

You can use %in% with apply:

C <- +t(apply(B, 1, "%in%", x=A))
colnames(C) <- A
C
#  A01B A01C A01D A01E A01F A01G
#a    0    1    0    0    0    0
#b    0    1    1    0    0    0
#c    0    0    0    0    0    0
#d    0    0    0    0    1    0

Data:

A <- c("A01B", "A01C", "A01D", "A01E", "A01F", "A01G")
B <- read.table(row.names=2, text="
id1 a A01C  NA    NA    NA 
id2 b A01C A01D   NA    NA
id3 c B01C B03D   NA    NA
id4 d A01F A01F  A01F   NA")[-1]

Upvotes: 0

mathematical.coffee

Reputation: 56935

Many ways to do this; here is one I can think of. (There is probably something super slick and efficient but I think for 300k rows this will be OK).

First convert your code into a reproducible example.

Here A is a vector in R (read yours in as necessary and coerce to vector)

A <- c("A01B", "A01C", "A01D", "A01E", "A01F", "A01G")

I'm using the data.table package here because I like its syntax. You will need to make your B a data.table not just a data.frame

library(data.table)

# I used dput(B) to get this command to create a reproducible example
B <- data.table(structure(list(col1 = c("id1", "id2", "id3", "id4"), col2 = c("a", 
"b", "c", "d"), col3 = c("A01C", "A01C", "B01C", "A01F"), col4 = c(NA, 
"A01D", "B03D", "A01F"), col5 = c(NA, NA, NA, "A01F"), col6 = c(NA_character_, 
NA_character_, NA_character_, NA_character_)), class = "data.frame", row.names = c(NA, 
-4L)))

#      col1   col2   col3   col4   col5   col6
#    <char> <char> <char> <char> <char> <char>
# 1:    id1      a   A01C   <NA>   <NA>   <NA>
# 2:    id2      b   A01C   A01D   <NA>   <NA>
# 3:    id3      c   B01C   B03D   <NA>   <NA>
# 4:    id4      d   A01F   A01F   A01F   <NA>

Now to your problem. Answer then explanation. Answer:

> col_names <- tail(names(B), -2)
> B[,
     sapply(
         A,
         function (code) { pmin(1, rowSums(.SD == code, na.rm=T)) },
         simplify=F, USE.NAMES=T
      ),
      .SDcols=col_names
    ]
    A01B  A01C  A01D  A01E  A01F  A01G
   <num> <num> <num> <num> <num> <num>
1:     0     1     0     0     0     0
2:     0     1     1     0     0     0
3:     0     0     0     0     0     0
4:     0     0     0     0     1     0

Edit: just realised it's way easier to read if you ditch the data frame and just use a matrix of all but your first 2 columns of B! Your result will also be a matrix rather than a data frame.

# B[, ..col_names] if using a data.table
# B[, col_names] if using a data.frame
sapply(A, function (code) { pmin(1, rowSums(B[, ..col_names] == code, na.rm=T)) })
     A01B A01C A01D A01E A01F A01G
[1,]    0    1    0    0    0    0
[2,]    0    1    1    0    0    0
[3,]    0    0    0    0    0    0
[4,]    0    0    0    0    1    0

Explanation: First presume I only have one code 'A01C' and am just trying to produce the A01C column.

First make a vector of column names we want to check (everything except the first 2)

col_names <- tail(names(B), -2)

Then check if any of these columns is A01C (the .SDcols=col_names just selects columns 3 to 6)

# this is TRUE if the column has A01C in it. 
> B[, .SD == 'A01C', .SDcols=col_names]
      col3  col4  col5 col6
[1,]  TRUE    NA    NA   NA
[2,]  TRUE FALSE    NA   NA
[3,] FALSE FALSE    NA   NA
[4,] FALSE FALSE FALSE   NA

But we want to combine these to one value per row. We can do this by adding the TRUEs in each row, which returns the number of matches. rowSums will do this. I add na.rm=T to treat the NA as 0. The .(A01C=rowSums(...)) syntax just says "make the output a column called A01C".

# But we wnat to condense this to one value per row.
> B[, .(A01C=rowSums(.SD == 'A01C', na.rm=T) > 0), .SDcols=col_names]
    A01C
   <num>
1:     1
2:     1
3:     0
4:     0

Great, so now we just have to loop over every code in A and do this for each.

> B[,
     sapply(
         A,
         function (code) { rowSums(.SD == code, na.rm=T) },
         simplify=F, USE.NAMES=T
      ),
      .SDcols=col_names
    ]
    A01B  A01C  A01D  A01E  A01F  A01G
   <num> <num> <num> <num> <num> <num>
1:     0     1     0     0     0     0
2:     0     1     1     0     0     0
3:     0     0     0     0     0     0
4:     0     0     0     0     3     0

Except note that this returns the number of matches (e.g. the A01F column row 4 has '3' rather than '1' because there are 3 A01Fs in that row). You seem to want just a 1 or 0, so we can just take the minimum of each number and 1 (or we could do a > 0 check and convert to numeric, doesn't matter). To do this we change rowSums(...) to pmin(1, rowSums(...)) and get the desired result already posted above.

Upvotes: 1

Ronak Shah

Reputation: 389135

You can combine the column values in B into one column using tidyr::unite and then expand them into 1/0 values using cSplit_e from splitstackshape

result <- B %>%
 tidyr::unite(tmp, V3:V4, na.rm = TRUE) %>%
 splitstackshape::cSplit_e('tmp', sep = '_', type = 'character', fill = 0)

result

#   V1 V2       tmp tmp_A01C tmp_A01D tmp_A01F tmp_B01C tmp_B03D
#1 id1  a      A01C        1        0        0        0        0
#2 id2  b A01C_A01D        1        1        0        0        0
#3 id3  c B01C_B03D        0        0        0        1        1
#4 id4  d A01F_A01F        0        0        1        0        0

If there are certain values in A which are not present in B at all we can use setdiff to create those columns in result.

result[setdiff(unlist(A), names(result))] <- 0

Upvotes: 0

conditional match and return values in r

Answers (3)

Related Questions