Reputation: 955
I have a vector that contains just over quarter of a million values (I know, a huge amount) and I need to calculate the difference of each value from each other. So for example, with the first value 202.7952, I want to calculate the difference between every other value in my vector with 202.7952 and restrict it so that differences above 400 are discarded. Then, I want it to take the second value (202.7956) and do the same thing (including with the value above). The end result of this I hope will be a list of values that are the calculated differences of the values in my vector. For example:-
0.0004
0.0125
0.0136
etc
would be produced from taking the difference for the first value from the next three values in the list, and it continuing on to the bottom before doing the same thing but for the second value in the list. However, as I have a quarter of a million values in my vector, I know there may be some computational problem. I've produced an image to show the distribution of my data:-
The values I have range from 200 to 1500, with the vast majority of the values falling within the 200-500 range. I've tried to do this in java but I run into memory issues, so do any of you think/know if it's possible to do this in R and how I could go about doing so?
This is my java code:-
public class matrixDiff {
public static void main(String[] args) throws IOException{
double[] values = new double[271730];
BufferedReader br = new BufferedReader(new FileReader("file"));
String value = br.readLine();
for(int i = 0; i < values.length; i++){
if(value != null){
values[i] = Double.parseDouble(value);
}
value = br.readLine();
}
for(int i = 0; i < values.length; i++){
double mzValue = values[i];
System.out.println(mzValue);
for(int j = 0; j < values.length; j++){
double diff = values[j];
if((diff - mzValue) < 400 || (diff - mzValue) > -400){
System.out.println(diff - mzValue);
}
}
}
}
}
Thanks
Upvotes: 1
Views: 4684
Reputation: 10401
Here's an example of how you could proceed. Sample data of size 1000.
memory.limit(max = NA)
# filter out differences larger than K
K = 25
v <- rnorm(n = 1000, mean = 200, sd = 10)
diffs <- list()
for(i in seq_along(v)) {
diffs[[i]] <- v[i] - v
diffs[[i]] <- diffs[[i]][diffs[[i]] <= K]
}
# Check lengths after filtering
sapply(diffs, length)
EDIT
I don't know if you considered it or if you solved your problem already, but to deal with that amount of data, one thing you could do it to store everything in a database. For instance:
library(RSQLite)
con <- dbConnect(SQLite(), "diffs.sqlite")
memory.size(max = NA)
v <- rnorm(n = 100000, mean = 200, sd = 10)
# filter out differences larger than K
K = 25
for(i in seq_along(v)) {
diffs <- v[i] - v
diffs <- diffs[diffs <= K]
dbWriteTable(con, "mytable", as.data.frame(diffs), append=TRUE)
}
Then there's stuff you could do using SQL rather than R functions and that would not create memory problems.
Upvotes: 2
Reputation: 1664
Vectors are your friends in R. Huge time and memory saver.
Data frame example:
df <- data.frame(x = rnorm(1000000))
df$dif <- df$x - c(NA, df$x[1:(length(df$x)-1)])
There you go, difference of 1kk numbers in a blink of an eye.
Vector example:
x <- rnorm(1000000)
x <- c(NA, x[1:(length(x)-1)])
Or even shorter:
x <- rnorm(1000000)
x <- c(NA, diff(x))
To accumulate values through the vector you'll need cumsum():
x <- rnorm(1000000)
x <- cumsum(c(0, diff(x)))
Note the 0 insted of NA.
Upvotes: 3