Reputation: 1619
I have 2 data frames:
at1 = data.frame(ID = c("A", "B", "C", "D", "E"), Sample1 = rnorm(5, 50000, 2500),
Sample2 = rnorm(5, 50000, 2500), Sample3 = rnorm(5, 50000, 2500),
row.names = "ID")
Sample1 Sample2 Sample3
A 52626.55 51924.51 50919.90
B 51430.51 49100.38 51005.92
C 50038.27 52254.73 50014.78
D 48644.46 53926.53 51590.05
E 46462.01 45097.48 50963.39
bt1 = data.frame(ID = c("A", "B", "C", "D", "E"), Sample1 = c(0,1,1,1,1),
Sample2 = c(0,0,0,1,0), Sample3 = c(1,0,1,1,0),
row.names = "ID")
Sample1 Sample2 Sample3
A 0 0 1
B 1 0 0
C 1 0 1
D 1 1 1
E 1 0 0
I would like to filter every cell in at1 based on the value in the corresponding cell in bt1 (0 or 1) and have the result stored in a new data frame ct1. For instance, if bt1[1, "Sample1"] = 1 then ct1[1, "Sample1"] = at1[1, "Sample1"]. If bt1[1, "Sample1"] = 0 then ct1[1, "Sample1"] = 0. My original data frames have more than 100 columns and more than 30,000 rows.
I was wondering if there is an easier way than writing if-loops (e.g. using "apply"?).
Upvotes: 9
Views: 1836
Reputation: 7475
A cheeky approach using sqldf
library(sqldf)
variables <- "bt1.Sample1*at1.Sample1 Sample1,
bt1.Sample2*at1.Sample2 Sample2,
bt1.Sample3*at1.Sample3 Sample3"
fn$sqldf("SELECT $variables from at1,bt1 WHERE at1.ROWID=bt1.ROWID")
# Sample1 Sample2 Sample3
#1 0.00 0.00 55778.34
#2 48819.24 0.00 0.00
#3 51896.14 0.00 52522.69
#4 47946.93 48604.23 47755.30
#5 49423.68 0.00 0.00
Upvotes: 5
Reputation: 115382
Here is a data.table
solution, and a second simplistic solution
note that I have made ID
a specific column in the data.frame
not the row.names
for ideological and pratical reasons
data.table
does not have rownameslibrary(data.table)
library(reshape2)
bt1 <- data.frame(ID = c("A", "B", "C", "D", "E"), Sample1 = c(0,1,1,1,1),
Sample2 = c(0,0,0,1,0), Sample3 = c(1,0,1,1,0))
at1 <- data.frame(ID = c("A", "B", "C", "D", "E"), Sample1 = rnorm(5, 50000, 2500),
Sample2 = rnorm(5, 50000, 2500), Sample3 = rnorm(5, 50000, 2500))
# place in long form
at_long <- data.table(melt(at1, id.var = 1))
bt_long <- data.table(melt(bt1, value.name = 'bt_value', id.var = 1))
# set keys for easy merging with data.tabl
setkeyv(at_long, c('ID','variable'))
setkeyv(bt_long, c('ID','variable'))
# merge
combined <- at_long[bt_long]
# set those where 'bt_value == 0' as 0
set(combined, which(combined[['bt_value']]==0), 'value',0)
# or (using the fact that the `bt` data is only 0 or 1
combined[value := value * bt_value]
# then reshape to wide format
dcast(combined, ID~variable, value.var = 'value')
## ID Sample1 Sample2 Sample3
## 1 A 0.00 0.00 50115.24
## 2 B 50173.16 0.00 0.00
## 3 C 48216.31 0.00 51952.30
## 4 D 52387.53 50889.95 44043.66
## 5 E 50982.56 0.00 0.00
If you know that the row orders are the same in bt1
and at1
(your data sets), you can simply multiply the appropriate components of the data.frames (*
works element-wise)
sample_cols <- paste0('Sample',1:3)
at1[,sample_cols] * bt1[,sample_cols]
## Sample1 Sample2 Sample3
## 1 0.00 0.00 50115.24
## 2 50173.16 0.00 0.00
## 3 48216.31 0.00 51952.30
## 4 52387.53 50889.95 44043.66
## 5 50982.56 0.00 0.00
which you could cbind
to the ID
from at1
or bt1
or if kept the ID
as row.names
, then the row.names would persist.
Upvotes: 7
Reputation: 56905
You can use vectorisation (amongst other things).
For example:
ct1 <- at1 # set ct1 equal to at1
ct1$Sample1[bt1$Sample1 == 0] <- 0 # if bt1$Sample1 = 0, set the value to 0
For the second line: bt1$Sample1 == 0
is a logical vector that is TRUE
if bt1$Sample1
is 0, and then we use it as an index into ct1
in order to set those values to 0. Since ct1
is initialised to at1
, all the other rows (where bt1$Sample1 == 1
) are set to the value in at1
.
Another way to do this could be using ifelse
, which is a vectorised form of an if statement:
ct1$Sample1 <- ifelse(bt1$Sample1 == 0, 0, at1$Sample1)
It means "for each row in bt1$Sample1
, if bt1$Sample1[row] == 0
substitute a 0, and otherwise substitute at1$Sample1[row]
.
You can repeat this for each column you are interested in.
You could loop through columns, or you could use something like vapply
to say:
for each column `col` in bt1:
ct1$col <- ifelse(bt1$col == 0, 0, at1$col)
This can be achieved by:
ct1 <- vapply(colnames(bt1), function (col) {
ifelse(bt1[[col]] == 0, 0, at1[[col]])
}, FUN.VALUE=at1$Sample1)
See ?vapply
, but in brief:
colnames(bt1)
means "for each column in bt
",function (col) { ifelse(bt1[[col]] == 0, 0, at1[[col]]) }
is the statement in the pseudocode above: set the value eqqual to 0 if bt1 is 0 and set it to the value in at1
otherwise,FUN.VALUE=at1$Sample1
is because vapply
requires an example of what the function will output (in our case, a column of a data frame).Upvotes: 5