Reputation: 1275
I am trying to filter out rows in my data frame (MainData), based on a criteria of p-values in another dataframe (PvalueData). So, what I want is: if more than 50% of the colums in a row has a p-value >0.05 (PvalueData), then that particular row will be removed from the main dataframe (MainDatA).
Lets say, here is the data I have:
MainData:
C1 C2 C3 C4 C5
Gene1 70 54 54 75 75
Gene2 23 18 16 54 15
Gene3 43 93 90 43 92
Gene4 32 50 23 13 45
Gene5 44 53 46 34 47
Gene6 42 34 53 85 43
Gene7 49 55 67 49 89
Gene8 25 45 49 34 35
Gene9 19 16 54 53 94
PvalueData:
C1 C2 C3 C4 C5
Gene1 0.04 0.01 0.01 0.01 0.01
Gene2 0.01 0.01 0.01 0.02 0.01
Gene3 0.01 0.07 0.09 0.01 0.06
Gene4 0.01 0.03 0.06 0.01 0.02
Gene5 0.04 0.01 0.07 0.08 0.01
Gene6 0.09 0.07 0.01 0.06 0.06
Gene7 0.10 0.07 0.01 0.01 0.06
Gene8 0.01 0.01 0.02 0.01 0.01
Gene9 0.09 0.01 0.07 0.08 0.06
So my result file should look like:
Result:
C1 C2 C3 C4 C5
Gene1 70 54 54 75 75
Gene2 23 18 16 54 15
Gene4 32 50 23 13 45
Gene5 44 53 46 34 47
Gene8 25 45 49 34 35
I did try something like this:
check = if (PvalueData[!rowSums(PvalueData>thres) > ncol(PvalueData)*.5], MainData)
But can't really work it out.
Upvotes: 2
Views: 588
Reputation: 12935
This is what you could do using rowSums
and subset
:
subset(df, rowSums(df2>.05)<ceiling(ncol(df)/2))
# C1 C2 C3 C4 C5
# Gene1 70 54 54 75 75
# Gene2 23 18 16 54 15
# Gene4 32 50 23 13 45
# Gene5 44 53 46 34 47
# Gene8 25 45 49 34 35
Keep those rows that less than 50%
of the p-values are greater than 0.05
.
DATA
df <- structure(list(C1 = c(70L, 23L, 43L, 32L, 44L, 42L, 49L, 25L,
19L), C2 = c(54L, 18L, 93L, 50L, 53L, 34L, 55L, 45L, 16L), C3 = c(54L,
16L, 90L, 23L, 46L, 53L, 67L, 49L, 54L), C4 = c(75L, 54L, 43L,
13L, 34L, 85L, 49L, 34L, 53L), C5 = c(75L, 15L, 92L, 45L, 47L,
43L, 89L, 35L, 94L)), .Names = c("C1", "C2", "C3", "C4", "C5"
), class = "data.frame", row.names = c("Gene1", "Gene2", "Gene3",
"Gene4", "Gene5", "Gene6", "Gene7", "Gene8", "Gene9"))
df2 <- structure(list(C1 = c(0.04, 0.01, 0.01, 0.01, 0.04, 0.09, 0.1,
0.01, 0.09), C2 = c(0.01, 0.01, 0.07, 0.03, 0.01, 0.07, 0.07,
0.01, 0.01), C3 = c(0.01, 0.01, 0.09, 0.06, 0.07, 0.01, 0.01,
0.02, 0.07), C4 = c(0.01, 0.02, 0.01, 0.01, 0.08, 0.06, 0.01,
0.01, 0.08), C5 = c(0.01, 0.01, 0.06, 0.02, 0.01, 0.06, 0.06,
0.01, 0.06)), .Names = c("C1", "C2", "C3", "C4", "C5"), class = "data.frame", row.names = c("Gene1",
"Gene2", "Gene3", "Gene4", "Gene5", "Gene6", "Gene7", "Gene8",
"Gene9"))
Upvotes: 2
Reputation: 38520
Here is an answer using rowMeans:
df[rowMeans(df2 < 0.05) > 0.5,]
C1 C2 C3 C4 C5
Gene1 70 54 54 75 75
Gene2 23 18 16 54 15
Gene4 32 50 23 13 45
Gene5 44 53 46 34 47
Gene8 25 45 49 34 35
Here is a quick breakdown of the code:
df2 < 0.05
coerces the data.frame into a matrix made up of logical elements (TRUE and FALSE) where an element is TRUE if it fits your p-value criterion.rowMeans
calculates the mean of these logical values for each row.data
df <- read.table(header=T, text="C1 C2 C3 C4 C5
Gene1 70 54 54 75 75
Gene2 23 18 16 54 15
Gene3 43 93 90 43 92
Gene4 32 50 23 13 45
Gene5 44 53 46 34 47
Gene6 42 34 53 85 43
Gene7 49 55 67 49 89
Gene8 25 45 49 34 35
Gene9 19 16 54 53 94")
df2 <- read.table(header=T, text="C1 C2 C3 C4 C5
Gene1 0.04 0.01 0.01 0.01 0.01
Gene2 0.01 0.01 0.01 0.02 0.01
Gene3 0.01 0.07 0.09 0.01 0.06
Gene4 0.01 0.03 0.06 0.01 0.02
Gene5 0.04 0.01 0.07 0.08 0.01
Gene6 0.09 0.07 0.01 0.06 0.06
Gene7 0.10 0.07 0.01 0.01 0.06
Gene8 0.01 0.01 0.02 0.01 0.01
Gene9 0.09 0.01 0.07 0.08 0.06")
Upvotes: 5
Reputation: 37889
Maybe not the most efficient way, but in base R you could use a simple apply
:
df1[apply(df2, 1, function(x) sum(x <= 0.05) >= 3), ]
# C1 C2 C3 C4 C5
#Gene1 70 54 54 75 75
#Gene2 23 18 16 54 15
#Gene4 32 50 23 13 45
#Gene5 44 53 46 34 47
#Gene8 25 45 49 34 35
Essentially apply
will iterate through rows and if the sum of x <= 0.05 is greater (or equal) than 3 then will return TRUE otherwise FALSE. We then use this logical vector to subset df1
Data
df1 <- read.table(text = " C1 C2 C3 C4 C5
Gene1 70 54 54 75 75
Gene2 23 18 16 54 15
Gene3 43 93 90 43 92
Gene4 32 50 23 13 45
Gene5 44 53 46 34 47
Gene6 42 34 53 85 43
Gene7 49 55 67 49 89
Gene8 25 45 49 34 35
Gene9 19 16 54 53 94")
df2 <- read.table(text = " C1 C2 C3 C4 C5
Gene1 0.04 0.01 0.01 0.01 0.01
Gene2 0.01 0.01 0.01 0.02 0.01
Gene3 0.01 0.07 0.09 0.01 0.06
Gene4 0.01 0.03 0.06 0.01 0.02
Gene5 0.04 0.01 0.07 0.08 0.01
Gene6 0.09 0.07 0.01 0.06 0.06
Gene7 0.10 0.07 0.01 0.01 0.06
Gene8 0.01 0.01 0.02 0.01 0.01
Gene9 0.09 0.01 0.07 0.08 0.06")
Upvotes: 3