Jibril
Jibril

Reputation: 1037

R - Working on Speed / Efficiency - Creating + Updating a Matrix from a large file

Ill start with my intuition - I presume there is some way to do this using "apply" without looping over every single entry in my file, but I cannot figure it out.

I have a very large file. Around 4,000,000 rows, 6 columns. For the sake of this discussion, only 3 of those columns are relevant. They are day, grade1 and grade2

I want to create a matrix with multiple fields, using those 3 columns from my .txt mentioned above.

I want day, grade1Wins, grade2Wins, ties, grade1Score, grade2Score, grade1Avg and grade2Avg. These are all per day.

So my output matrix will have those 8 columns, and rows equal to the number of days that there are.

I define these as follows.

Grade1Wins : Starts at 0, increment by 1 for each row where Grade1 > |Grade2|. Grade2Wins : Starts at 0, increment by 1 for each row where |Grade2| > Grade1. ties : Starts at 0, increment by 1 for each row where Grade1 == |Grade2|

Grade1Score : Summation of raw score values. Grade2Score : Summation of raw score values.

Grade1Avg :Grade1Score/ (numRows) Grade2Avg : Grade2Score / (numRows)

I want this all as a matrix. So, for example, my data may look like....(subset to only show the 2 relevant columns)

Day    Grade1    Grade2
1    2         -4
1    4         -4
2    10        -1

So my output matrix would be

`day` `grade1wins` `grade2wins` `ties` `grade1score` `grade2score` `grade1avg` `grade2avg`
  1        0           1          1         6           -8             2          -4
  2        1           0          0         10          -1             10          -1

Currently, I do this via loops. My matrix is pre-defined (# cols / # rows determined in advance, have a blank matrix full of 0's waiting for editting).

I loop through every row of my tab-delimited .txt file.

I determine the day of the current row. This is my row number. I pull the values from the row for grade1 and grade2.

#Loop over every single row.
for(i in 1:len)
{
            entry = entries[i,]
            rowNum = entry$day
            if( entry$grade1> abs(entry$grade2) )
            {
                mat[rowNum, "grade1wins"] = mat[rowNum, "grade1wins"] + 1 ## Increment the counter 
            } else if( abs(entry$grade2) > entry$grade1 ) {
                mat[rowNum, "grade2wins"] = mat[rowNum, "grade2wins"] + 1 ## Increment the counter 
            } else {
                mat[rowNum, "ties"] = mat[rowNum, "ties"] + 1
            }

            mat[rowNum, "grade1"] = mat[rowNum, "grade1"] + entry$grade1
            mat[rowNum, "grade2"] = mat[rowNum, "grade2"] + entry$grade2
} # end loop, we went through every single entry now
mat[, "PosAvg"] = mat[,"PosScore"] / mat[, "NumTweets"]
mat[, "NegAvg"] = mat[,"NegScore"] / mat[, "NumTweets"]

I presume there is some way to do this using "apply" without looping over every single entry in my file, but I cannot figure it out.

Upvotes: 2

Views: 50

Answers (1)

Kunal Puri
Kunal Puri

Reputation: 3427

You can do this with the help of data.table package.

You can read the dataset with the help of fread() function and save the dataset to a variable, say data2.

Now, you can apply the desired operation by this code:

test_function <- function(dt){
    grade1wins <- length(which(dt$Grade1 > abs(dt$Grade2)))
    grade2wins <- length(which(dt$Grade1 < abs(dt$Grade2)))
    ties <- nrow(dt) - grade1wins - grade2wins
    grade1score <- sum(dt$Grade1)
    grade2score <- sum(dt$Grade2)
    grade1avg <- mean(dt$Grade1)
    grade2avg <- mean(dt$Grade2)

    return (list(grade1wins = grade1wins,grade2wins = grade2wins,
        ties = ties,grade1score = grade1score,grade2score = grade2score,
        grade1avg = grade1avg,grade2avg = grade2avg))
}

> as.matrix(data2[,test_function(.SD),by=Day])
     Day grade1wins grade2wins ties grade1score grade2score grade1avg grade2avg
[1,]   1          0          1    1           6          -8         3        -4
[2,]   2          1          0    0          10          -1        10        -1

Upvotes: 3

Related Questions