Letin
Letin

Reputation: 1275

How to remove rows in a dataframe based on values of another dataframe

I am trying to filter out rows in my data frame (MainData), based on a criteria of p-values in another dataframe (PvalueData). So, what I want is: if more than 50% of the colums in a row has a p-value >0.05 (PvalueData), then that particular row will be removed from the main dataframe (MainDatA).

Lets say, here is the data I have:

MainData:

        C1   C2   C3   C4   C5
Gene1   70   54   54   75   75
Gene2   23   18   16   54   15
Gene3   43   93   90   43   92
Gene4   32   50   23   13   45
Gene5   44   53   46   34   47
Gene6   42   34   53   85   43
Gene7   49   55   67   49   89
Gene8   25   45   49   34   35
Gene9   19   16   54   53   94

PvalueData:

        C1     C2     C3     C4     C5
Gene1   0.04   0.01   0.01   0.01   0.01
Gene2   0.01   0.01   0.01   0.02   0.01
Gene3   0.01   0.07   0.09   0.01   0.06
Gene4   0.01   0.03   0.06   0.01   0.02
Gene5   0.04   0.01   0.07   0.08   0.01
Gene6   0.09   0.07   0.01   0.06   0.06
Gene7   0.10   0.07   0.01   0.01   0.06
Gene8   0.01   0.01   0.02   0.01   0.01
Gene9   0.09   0.01   0.07   0.08   0.06

So my result file should look like:

Result:

        C1   C2   C3   C4   C5
Gene1   70   54   54   75   75
Gene2   23   18   16   54   15
Gene4   32   50   23   13   45
Gene5   44   53   46   34   47
Gene8   25   45   49   34   35

I did try something like this:

check = if (PvalueData[!rowSums(PvalueData>thres) > ncol(PvalueData)*.5], MainData)

But can't really work it out.

Upvotes: 2

Views: 588

Answers (3)

989
989

Reputation: 12935

This is what you could do using rowSums and subset:

subset(df, rowSums(df2>.05)<ceiling(ncol(df)/2))

      # C1 C2 C3 C4 C5
# Gene1 70 54 54 75 75
# Gene2 23 18 16 54 15
# Gene4 32 50 23 13 45
# Gene5 44 53 46 34 47
# Gene8 25 45 49 34 35

Keep those rows that less than 50% of the p-values are greater than 0.05.

DATA

df <- structure(list(C1 = c(70L, 23L, 43L, 32L, 44L, 42L, 49L, 25L, 
19L), C2 = c(54L, 18L, 93L, 50L, 53L, 34L, 55L, 45L, 16L), C3 = c(54L, 
16L, 90L, 23L, 46L, 53L, 67L, 49L, 54L), C4 = c(75L, 54L, 43L, 
13L, 34L, 85L, 49L, 34L, 53L), C5 = c(75L, 15L, 92L, 45L, 47L, 
43L, 89L, 35L, 94L)), .Names = c("C1", "C2", "C3", "C4", "C5"
), class = "data.frame", row.names = c("Gene1", "Gene2", "Gene3", 
"Gene4", "Gene5", "Gene6", "Gene7", "Gene8", "Gene9"))

df2 <- structure(list(C1 = c(0.04, 0.01, 0.01, 0.01, 0.04, 0.09, 0.1, 
0.01, 0.09), C2 = c(0.01, 0.01, 0.07, 0.03, 0.01, 0.07, 0.07, 
0.01, 0.01), C3 = c(0.01, 0.01, 0.09, 0.06, 0.07, 0.01, 0.01, 
0.02, 0.07), C4 = c(0.01, 0.02, 0.01, 0.01, 0.08, 0.06, 0.01, 
0.01, 0.08), C5 = c(0.01, 0.01, 0.06, 0.02, 0.01, 0.06, 0.06, 
0.01, 0.06)), .Names = c("C1", "C2", "C3", "C4", "C5"), class = "data.frame", row.names = c("Gene1", 
"Gene2", "Gene3", "Gene4", "Gene5", "Gene6", "Gene7", "Gene8", 
"Gene9"))

Upvotes: 2

lmo
lmo

Reputation: 38520

Here is an answer using rowMeans:

df[rowMeans(df2 < 0.05) > 0.5,]
      C1 C2 C3 C4 C5
Gene1 70 54 54 75 75
Gene2 23 18 16 54 15
Gene4 32 50 23 13 45
Gene5 44 53 46 34 47
Gene8 25 45 49 34 35

Here is a quick breakdown of the code:

  • df2 < 0.05 coerces the data.frame into a matrix made up of logical elements (TRUE and FALSE) where an element is TRUE if it fits your p-value criterion.
  • rowMeans calculates the mean of these logical values for each row.
  • These means are used to subset the main data.frame according to the second criterion.

data

df <- read.table(header=T, text="C1   C2   C3   C4   C5
Gene1   70   54   54   75   75
Gene2   23   18   16   54   15
Gene3   43   93   90   43   92
Gene4   32   50   23   13   45
Gene5   44   53   46   34   47
Gene6   42   34   53   85   43
Gene7   49   55   67   49   89
Gene8   25   45   49   34   35
Gene9   19   16   54   53   94")

df2 <- read.table(header=T, text="C1     C2     C3     C4     C5
Gene1   0.04   0.01   0.01   0.01   0.01
Gene2   0.01   0.01   0.01   0.02   0.01
Gene3   0.01   0.07   0.09   0.01   0.06
Gene4   0.01   0.03   0.06   0.01   0.02
Gene5   0.04   0.01   0.07   0.08   0.01
Gene6   0.09   0.07   0.01   0.06   0.06
Gene7   0.10   0.07   0.01   0.01   0.06
Gene8   0.01   0.01   0.02   0.01   0.01
Gene9   0.09   0.01   0.07   0.08   0.06")

Upvotes: 5

LyzandeR
LyzandeR

Reputation: 37889

Maybe not the most efficient way, but in base R you could use a simple apply:

df1[apply(df2, 1, function(x) sum(x <= 0.05) >= 3), ]
#      C1 C2 C3 C4 C5
#Gene1 70 54 54 75 75
#Gene2 23 18 16 54 15
#Gene4 32 50 23 13 45
#Gene5 44 53 46 34 47
#Gene8 25 45 49 34 35

Essentially apply will iterate through rows and if the sum of x <= 0.05 is greater (or equal) than 3 then will return TRUE otherwise FALSE. We then use this logical vector to subset df1

Data

df1 <- read.table(text = "        C1   C2   C3   C4   C5
Gene1   70   54   54   75   75
Gene2   23   18   16   54   15
Gene3   43   93   90   43   92
Gene4   32   50   23   13   45
Gene5   44   53   46   34   47
Gene6   42   34   53   85   43
Gene7   49   55   67   49   89
Gene8   25   45   49   34   35
Gene9   19   16   54   53   94")

df2 <- read.table(text = "        C1     C2     C3     C4     C5
Gene1   0.04   0.01   0.01   0.01   0.01
Gene2   0.01   0.01   0.01   0.02   0.01
Gene3   0.01   0.07   0.09   0.01   0.06
Gene4   0.01   0.03   0.06   0.01   0.02
Gene5   0.04   0.01   0.07   0.08   0.01
Gene6   0.09   0.07   0.01   0.06   0.06
Gene7   0.10   0.07   0.01   0.01   0.06
Gene8   0.01   0.01   0.02   0.01   0.01
Gene9   0.09   0.01   0.07   0.08   0.06")

Upvotes: 3

Related Questions