Reputation: 5567
I have data in the following format called DF (this is just a made up simplified sample):
eval.num, eval.count, fitness, fitness.mean, green.h.0, green.v.0, offset.0 random
1 1 1500 1500 100 120 40 232342
2 2 1000 1250 100 120 40 11843
3 3 1250 1250 100 120 40 981340234
4 4 1000 1187.5 100 120 40 4363453
5 1 2000 2000 200 100 40 345902
6 1 3000 3000 150 90 10 943
7 1 2000 2000 90 90 100 9304358
8 2 1800 1900 90 90 100 284333
However, the eval.count column is incorrect and I need to fix it. It should report the number of rows with the same values for (green.h.0, green.v.0, and offset.0) by only looking at the previous rows.
The example above uses the expected values, but assume they are incorrect.
How can I add a new column (say "count") which will count all previous rows which have the same values of the specified variables?
I have gotten help on a similar problem of just selecting all rows with the same values for specified columns, so I supposed I could just write a loop around that, but it seems inefficient to me.
Upvotes: 3
Views: 13091
Reputation: 5567
I have a solution I figured out over time (sorry I haven't checked this in a while)
checkIt <- function(bind) {
print(bind)
cmpfun <- function(r) {all(r == heeds.data[bind,23:47,drop=FALSE])}
brows <- apply(heeds.data[,23:47], 1, cmpfun)
#print(heeds.data[brows,c("eval.num","fitness","green.h.1","green.h.2","green.v.5")])
print(nrow(heeds.data[brows,c("eval.num","fitness","green.h.1","green.h.2","green.v.5")]))
}
Note that heeds.data is my actual data frame and I just printed a few columns originally to make sure that it was working (now commented out). Also, 23:47 is the part that needs to be checked for duplicates
Also, I really haven't learned as much R as I should so I'm open to suggestions.
Hope this helps!
Upvotes: 0
Reputation: 5567
Okay I used the answer I had on another question and worked out a loop that I think will work. This is what I'm going to use:
cmpfun2 <- function(r) {
count <- 0
if (r[1] > 1)
{
for (row in 1:(r[1]-1))
{
if(all(r[27:51] == DF[row,27:51,drop=FALSE])) # compare to row bind
{
count <- count + 1
}
}
}
return (count)
}
brows <- apply(DF[], 1, cmpfun2)
print(brows)
Please comment if I made a mistake and this won't work, but I think I've figured it out. Thanks!
Upvotes: 1
Reputation: 25327
Ok, let's first do it in the easy case where you just have one column.
> data <- rep(sample(1000, 5),
sample(5, 5))
> head(data)
[1] 435 435 435 278 278 278
Then you can just use rle to figure out the contiguous sequences:
> sequence(rle(data)$lengths)
[1] 1 2 3 1 2 3 4 5 1 2 3 4 1 2 1
Or altogether:
> head(cbind(data, sequence(rle(data)$lengths)))
[1,] 435 1
[2,] 435 2
[3,] 435 3
[4,] 278 1
[5,] 278 2
[6,] 278 3
For your case with multiple columns, there are probably a bunch of ways of applying this solution. Easiest might be to just paste
the columns you care about together to form a single vector.
Upvotes: 9