Reputation: 559

a speedy way to compare two rows

I have a large dataframe with 2 rows and 30406 columns. I need to count when the number of times a 0 is present in both rows in a given column (match) and the number of times a 0 is present in one row and not in the other given a column (no match).

I think if I just loop through everything and compare each column it will take too long given that there are >30k columns

head(to_compare)[1:5]
     bin:82154:182154 bin:82154:282154 bin:82154:382154 
bin:82154:482154
1-D1.txt                0                1                2                
0
1-D2.txt                1                1                1                
1
     bin:82154:582154
1-D1.txt                0
1-D2.txt                0

output

match
1

no_match
1

Upvotes: 0

Answers (4)

GKi

Reputation: 39727

set.seed(7)
n <- 30406
to_compare <- data.frame(matrix(floor(runif(n*2, 0, 3)), nrow = 2))

table(colSums(to_compare==0))
#    0     1     2 
#13519 13513  3374 
#
#0..no zero in column (13519)
#1..one row in column has a zero (13513)
#2..both rows in column are zero (3374)

system.time(table(colSums(to_compare==0)))
#       User      System verstrichen 
#      0.332       0.000       0.330

Upvotes: 1

Sven

Reputation: 1263

A different and very simple approach would be to first switch columns to rows and then just use rowSums:

#Create sample df
df <- data.frame(col1 = c(0,1), col2 = c(1,0), col3 = c(1,1), col4 = c(0,2), col5 = c(3,0), col6 = c(0,0))

#Convert columns to rows
df_long <- t(df)

#Count number of 0s in every row and show in table of 0, 1 or 2 zeros
table(rowSums(df_long == 0))

0 1 2 
1 4 1

Upvotes: 1

Andrew

Reputation: 5138

You could use colSums for a vectorized solution:

set.seed(123)
df <- as.data.frame(matrix(round(runif(50, 0, 2)), nrow = 2))

# Match 
sum(colSums(df==0) == 2)
[1] 2

# No match
sum(colSums(df==0) == 1)
[1] 8

Upvotes: 3

Andryas Waurzenczak

Reputation: 469

set.seed(123)
df <- as.data.frame(matrix(round(runif(10, 0, 2)), nrow = 2))

# Count the number of 0 for each column
sum(apply(df, 2, function(x) all(x == 0))) # Match

# Count the number of 0 is present in one row and not in another for each column
sum(apply(df, 2, function(x) any(x == 0) & (x[1] != x[2]))) # No match

Upvotes: 1

a speedy way to compare two rows

Answers (4)

Related Questions