Reputation: 11

Using R to remove data which is below a quartile threshold

I am creating correlations using R, with the following code:

Values<-read.csv(inputFile, header = TRUE)
O<-Values$Abundance_O
S<-Values$Abundance_S
cor(O,S)
pear_cor<-round(cor(O,S),4)
outfile<-paste(inputFile, ".jpg", sep = "")
jpeg(filename = outfile, width = 15, height = 10, units = "in", pointsize = 10, quality = 75, bg = "white", res = 300, restoreConsole = TRUE)
rx<-range(0,20000000)
ry<-range(0,200000)
plot(rx,ry, ylab="S", xlab="O", main="O vs S", type="n")
points(O,S, col="black", pch=3, lwd=1)
mtext(sprintf("%s %.4f", "pearson: ", pear_cor), adj=1, padj=0, side = 1, line = 4)
dev.off()
pear_cor

I now need to find the lower quartile for each set of data and exclude data that is within the lower quartile. I would then like to rewrite the data without those values and use the new column of data in the correlation analysis (because I want to threshold the data by the lower quartile). If there is a way I can write this so that it is easy to change the threshold by applying arguments from Java (as I have with the input file name) that's even better!

Thank you so much.

I have now implicated the answer below and that is working, however I need to keep the pairs of data together for the correlation. Here is an example of my data (from csv):

Abundance_O Abundance_S
3635900.752 1390.883073
463299.4622 1470.92626
359101.0482 989.1609251
284966.6421 3248.832403
415283.663  2492.231265
2076456.856 10175.48946
620286.6206 5074.268802
3709754.717 269.6856808
803321.0892 118.2935093
411553.0203 4772.499758
50626.83554 17.29893001
337428.8939 203.3536852
42046.61549 152.1321255
1372013.047 5436.783169
939106.3275 7080.770535
96618.01393 1967.834701
229045.6983 948.3087208
4419414.018 23735.19352

So I need to exclude both values in the row if one does not meet my quartile threshold (0.25 quartile). So if the quartile for O was 45000 then the row "42046.61549,152.1321255" would be removed. Is this possible? If I read in both columns as a dataframe can I search each column separately? Or find the quartiles and then input that value into code to remove the appropriate rows?

Thanks again, and sorry for the evolution of the question!

Upvotes: 1

Answers (2)

JDG

Reputation: 1364

Although this is an old question, I came across it during research of my own and I arrived at a solution that someone may be interested in.

I first defined a function which will convert a numerical vector into its quantile groups. Parameter n determines the quantile length (n = 4 for quartiles, n = 10 for deciles).

qgroup = function(numvec, n = 4){

  qtile = quantile(numvec, probs = seq(0, 1, 1/n))
  out = sapply(numvec, function(x) sum(x >= qtile[-(n+1)]))

  return(out)
}

Function example:

v = rep(1:20)

> qgroup(v)
[1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4

Consider now the following data:

dt = data.table(
  A0 = runif(100),
  A1 = runif(100)
)

We apply qgroup() across the data to obtain two quartile group columns:

cols = colnames(dt)
qcols = c('Q0', 'Q1')

dt[, (qcols) := lapply(.SD, qgroup), .SDcols = cols]

head(dt)

>   A0           A1        Q0 Q1
1:  0.72121846   0.1908863  3  1
2:  0.70373594   0.4389152  3  2
3:  0.04604934   0.5301261  1  3
4:  0.10476643   0.1108709  1  1
5:  0.76907762   0.4913463  4  2
6:  0.38265848   0.9291649  2  4

Lastly, we only include rows for which both quartile groups are above the first quartile:

dt = dt[Q0 + Q1 > 2]

Upvotes: 0

Simon O'Hanlon

Reputation: 59970

Please try to provide a reproducible example, but if you have data in a data.frame, you can subset it using the quantile function as the logical test. For instance, in the following data we want to select only rows from the dataframe where the value of the measured variable 'Val' is above the bottom quartile:

# set.seed so you can reproduce these values exactly on your system
set.seed(39856)
df <- data.frame( ID = 1:10 , Val = runif(10) )
df
   ID        Val
1   1 0.76487516
2   2 0.59755578
3   3 0.94584374
4   4 0.72179297
5   5 0.04513418
6   6 0.95772248
7   7 0.14566118
8   8 0.84898704
9   9 0.07246594
10 10 0.14136138

# Now to select only rows where the value of our measured variable 'Val' is above the bottom 25% quartile
df[ df$Val > quantile(df$Val , 0.25 ) , ]
  ID       Val
1  1 0.7648752
2  2 0.5975558
3  3 0.9458437
4  4 0.7217930
6  6 0.9577225
7  7 0.1456612
8  8 0.8489870

# And check the value of the bottom 25% quantile...
quantile(df$Val , 0.25 )
      25% 
0.1424363

Upvotes: 6

Using R to remove data which is below a quartile threshold

Answers (2)

Related Questions