Reputation: 117
I have two tables named A and B. There are 300k rows in B. Just one row in a. I want to generate a table C based on if there is a value in B match value in A for each row. If it is, return 1; If is not, return 0; Finally, get a matrix C, which includes (0, 1) with the same row as B. I use the Match function in excel, but my data is too large. Can realize it in R?
A:
A01B A01C A01D A01E A01F A01G
B:
id1 a A01C NA NA NA
id2 b A01C A01D NA NA
id3 c B01C B03D NA NA
id4 d A01F A01F A01F NA
...
C:
A01B A01C A01D A01E A01F A01G
0 1 0 0 0 0
0 1 1 0 0 0
0 0 0 0 0 0
0 0 0 0 1 0
Upvotes: 0
Views: 593
Reputation: 39707
You can use %in%
with apply
:
C <- +t(apply(B, 1, "%in%", x=A))
colnames(C) <- A
C
# A01B A01C A01D A01E A01F A01G
#a 0 1 0 0 0 0
#b 0 1 1 0 0 0
#c 0 0 0 0 0 0
#d 0 0 0 0 1 0
Data:
A <- c("A01B", "A01C", "A01D", "A01E", "A01F", "A01G")
B <- read.table(row.names=2, text="
id1 a A01C NA NA NA
id2 b A01C A01D NA NA
id3 c B01C B03D NA NA
id4 d A01F A01F A01F NA")[-1]
Upvotes: 0
Reputation: 56935
Many ways to do this; here is one I can think of. (There is probably something super slick and efficient but I think for 300k rows this will be OK).
First convert your code into a reproducible example.
Here A
is a vector in R (read yours in as necessary and coerce to vector)
A <- c("A01B", "A01C", "A01D", "A01E", "A01F", "A01G")
I'm using the data.table
package here because I like its syntax. You will need to make your B a data.table not just a data.frame
library(data.table)
# I used dput(B) to get this command to create a reproducible example
B <- data.table(structure(list(col1 = c("id1", "id2", "id3", "id4"), col2 = c("a",
"b", "c", "d"), col3 = c("A01C", "A01C", "B01C", "A01F"), col4 = c(NA,
"A01D", "B03D", "A01F"), col5 = c(NA, NA, NA, "A01F"), col6 = c(NA_character_,
NA_character_, NA_character_, NA_character_)), class = "data.frame", row.names = c(NA,
-4L)))
# col1 col2 col3 col4 col5 col6
# <char> <char> <char> <char> <char> <char>
# 1: id1 a A01C <NA> <NA> <NA>
# 2: id2 b A01C A01D <NA> <NA>
# 3: id3 c B01C B03D <NA> <NA>
# 4: id4 d A01F A01F A01F <NA>
Now to your problem. Answer then explanation. Answer:
> col_names <- tail(names(B), -2)
> B[,
sapply(
A,
function (code) { pmin(1, rowSums(.SD == code, na.rm=T)) },
simplify=F, USE.NAMES=T
),
.SDcols=col_names
]
A01B A01C A01D A01E A01F A01G
<num> <num> <num> <num> <num> <num>
1: 0 1 0 0 0 0
2: 0 1 1 0 0 0
3: 0 0 0 0 0 0
4: 0 0 0 0 1 0
Edit: just realised it's way easier to read if you ditch the data frame and just use a matrix of all but your first 2 columns of B! Your result will also be a matrix rather than a data frame.
# B[, ..col_names] if using a data.table
# B[, col_names] if using a data.frame
sapply(A, function (code) { pmin(1, rowSums(B[, ..col_names] == code, na.rm=T)) })
A01B A01C A01D A01E A01F A01G
[1,] 0 1 0 0 0 0
[2,] 0 1 1 0 0 0
[3,] 0 0 0 0 0 0
[4,] 0 0 0 0 1 0
Explanation: First presume I only have one code 'A01C' and am just trying to produce the A01C column.
First make a vector of column names we want to check (everything except the first 2)
col_names <- tail(names(B), -2)
Then check if any of these columns is A01C (the .SDcols=col_names
just selects columns 3 to 6)
# this is TRUE if the column has A01C in it.
> B[, .SD == 'A01C', .SDcols=col_names]
col3 col4 col5 col6
[1,] TRUE NA NA NA
[2,] TRUE FALSE NA NA
[3,] FALSE FALSE NA NA
[4,] FALSE FALSE FALSE NA
But we want to combine these to one value per row. We can do this by adding the TRUEs in each row, which returns the number of matches. rowSums
will do this. I add na.rm=T
to treat the NA as 0. The .(A01C=rowSums(...))
syntax just says "make the output a column called A01C".
# But we wnat to condense this to one value per row.
> B[, .(A01C=rowSums(.SD == 'A01C', na.rm=T) > 0), .SDcols=col_names]
A01C
<num>
1: 1
2: 1
3: 0
4: 0
Great, so now we just have to loop over every code in A
and do this for each.
> B[,
sapply(
A,
function (code) { rowSums(.SD == code, na.rm=T) },
simplify=F, USE.NAMES=T
),
.SDcols=col_names
]
A01B A01C A01D A01E A01F A01G
<num> <num> <num> <num> <num> <num>
1: 0 1 0 0 0 0
2: 0 1 1 0 0 0
3: 0 0 0 0 0 0
4: 0 0 0 0 3 0
Except note that this returns the number of matches (e.g. the A01F column row 4 has '3' rather than '1' because there are 3 A01Fs in that row). You seem to want just a 1 or 0, so we can just take the minimum of each number and 1 (or we could do a > 0 check and convert to numeric, doesn't matter). To do this we change rowSums(...)
to pmin(1, rowSums(...))
and get the desired result already posted above.
Upvotes: 1
Reputation: 389135
You can combine the column values in B
into one column using tidyr::unite
and then expand them into 1/0 values using cSplit_e
from splitstackshape
result <- B %>%
tidyr::unite(tmp, V3:V4, na.rm = TRUE) %>%
splitstackshape::cSplit_e('tmp', sep = '_', type = 'character', fill = 0)
result
# V1 V2 tmp tmp_A01C tmp_A01D tmp_A01F tmp_B01C tmp_B03D
#1 id1 a A01C 1 0 0 0 0
#2 id2 b A01C_A01D 1 1 0 0 0
#3 id3 c B01C_B03D 0 0 0 1 1
#4 id4 d A01F_A01F 0 0 1 0 0
If there are certain values in A
which are not present in B
at all we can use setdiff
to create those columns in result
.
result[setdiff(unlist(A), names(result))] <- 0
Upvotes: 0