Reputation: 1037
Ill start with my intuition - I presume there is some way to do this using "apply" without looping over every single entry in my file, but I cannot figure it out.
I have a very large file. Around 4,000,000 rows, 6 columns. For the sake of this discussion, only 3 of those columns are relevant. They are day
, grade1
and grade2
I want to create a matrix with multiple fields, using those 3 columns from my .txt mentioned above.
I want day
, grade1Wins
, grade2Wins
, ties
, grade1Score
, grade2Score
, grade1Avg
and grade2Avg
. These are all per day.
So my output matrix will have those 8 columns, and rows equal to the number of days that there are.
I define these as follows.
Grade1Wins
: Starts at 0, increment by 1 for each row where Grade1
> |Grade2|
.
Grade2Wins
: Starts at 0, increment by 1 for each row where |Grade2|
> Grade1
.
ties
: Starts at 0, increment by 1 for each row where Grade1
== |Grade2|
Grade1Score
: Summation of raw score values.
Grade2Score
: Summation of raw score values.
Grade1Avg :
Grade1Score/ (numRows)
Grade2Avg : Grade2Score
/ (numRows)
I want this all as a matrix. So, for example, my data may look like....(subset to only show the 2 relevant columns)
Day Grade1 Grade2
1 2 -4
1 4 -4
2 10 -1
So my output matrix would be
`day` `grade1wins` `grade2wins` `ties` `grade1score` `grade2score` `grade1avg` `grade2avg`
1 0 1 1 6 -8 2 -4
2 1 0 0 10 -1 10 -1
Currently, I do this via loops. My matrix is pre-defined (# cols / # rows determined in advance, have a blank matrix full of 0's waiting for editting).
I loop through every row of my tab-delimited .txt file.
I determine the day of the current row. This is my row number.
I pull the values from the row for grade1
and grade2
.
#Loop over every single row.
for(i in 1:len)
{
entry = entries[i,]
rowNum = entry$day
if( entry$grade1> abs(entry$grade2) )
{
mat[rowNum, "grade1wins"] = mat[rowNum, "grade1wins"] + 1 ## Increment the counter
} else if( abs(entry$grade2) > entry$grade1 ) {
mat[rowNum, "grade2wins"] = mat[rowNum, "grade2wins"] + 1 ## Increment the counter
} else {
mat[rowNum, "ties"] = mat[rowNum, "ties"] + 1
}
mat[rowNum, "grade1"] = mat[rowNum, "grade1"] + entry$grade1
mat[rowNum, "grade2"] = mat[rowNum, "grade2"] + entry$grade2
} # end loop, we went through every single entry now
mat[, "PosAvg"] = mat[,"PosScore"] / mat[, "NumTweets"]
mat[, "NegAvg"] = mat[,"NegScore"] / mat[, "NumTweets"]
I presume there is some way to do this using "apply" without looping over every single entry in my file, but I cannot figure it out.
Upvotes: 2
Views: 50
Reputation: 3427
You can do this with the help of data.table
package.
You can read the dataset with the help of fread()
function and save the dataset to a variable, say data2.
Now, you can apply the desired operation by this code:
test_function <- function(dt){
grade1wins <- length(which(dt$Grade1 > abs(dt$Grade2)))
grade2wins <- length(which(dt$Grade1 < abs(dt$Grade2)))
ties <- nrow(dt) - grade1wins - grade2wins
grade1score <- sum(dt$Grade1)
grade2score <- sum(dt$Grade2)
grade1avg <- mean(dt$Grade1)
grade2avg <- mean(dt$Grade2)
return (list(grade1wins = grade1wins,grade2wins = grade2wins,
ties = ties,grade1score = grade1score,grade2score = grade2score,
grade1avg = grade1avg,grade2avg = grade2avg))
}
> as.matrix(data2[,test_function(.SD),by=Day])
Day grade1wins grade2wins ties grade1score grade2score grade1avg grade2avg
[1,] 1 0 1 1 6 -8 3 -4
[2,] 2 1 0 0 10 -1 10 -1
Upvotes: 3