Matt
Matt

Reputation: 5567

R counting the occurrences of similar rows of data frame

I have data in the following format called DF (this is just a made up simplified sample):

eval.num, eval.count, fitness, fitness.mean, green.h.0, green.v.0, offset.0 random
1         1           1500     1500          100        120        40       232342
2         2           1000     1250          100        120        40       11843
3         3           1250     1250          100        120        40       981340234
4         4           1000     1187.5        100        120        40       4363453
5         1           2000     2000          200        100        40       345902
6         1           3000     3000          150        90         10       943
7         1           2000     2000          90         90         100      9304358
8         2           1800     1900          90         90         100      284333

However, the eval.count column is incorrect and I need to fix it. It should report the number of rows with the same values for (green.h.0, green.v.0, and offset.0) by only looking at the previous rows.

The example above uses the expected values, but assume they are incorrect.

How can I add a new column (say "count") which will count all previous rows which have the same values of the specified variables?

I have gotten help on a similar problem of just selecting all rows with the same values for specified columns, so I supposed I could just write a loop around that, but it seems inefficient to me.

Upvotes: 3

Views: 13091

Answers (3)

Matt
Matt

Reputation: 5567

I have a solution I figured out over time (sorry I haven't checked this in a while)

checkIt <- function(bind) {

    print(bind)

    cmpfun <- function(r) {all(r == heeds.data[bind,23:47,drop=FALSE])}
    brows <- apply(heeds.data[,23:47], 1, cmpfun)

    #print(heeds.data[brows,c("eval.num","fitness","green.h.1","green.h.2","green.v.5")])
    print(nrow(heeds.data[brows,c("eval.num","fitness","green.h.1","green.h.2","green.v.5")]))
}

Note that heeds.data is my actual data frame and I just printed a few columns originally to make sure that it was working (now commented out). Also, 23:47 is the part that needs to be checked for duplicates

Also, I really haven't learned as much R as I should so I'm open to suggestions.

Hope this helps!

Upvotes: 0

Matt
Matt

Reputation: 5567

Okay I used the answer I had on another question and worked out a loop that I think will work. This is what I'm going to use:

cmpfun2 <- function(r) {
    count <- 0
    if (r[1] > 1)
    {
        for (row in 1:(r[1]-1))
        {
            if(all(r[27:51] == DF[row,27:51,drop=FALSE]))  # compare to row bind
            {
                count <- count + 1
            }
        }
    }
    return (count)
}
brows <- apply(DF[], 1, cmpfun2)
print(brows)

Please comment if I made a mistake and this won't work, but I think I've figured it out. Thanks!

Upvotes: 1

Jonathan Chang
Jonathan Chang

Reputation: 25327

Ok, let's first do it in the easy case where you just have one column.

> data <- rep(sample(1000, 5),
              sample(5, 5))
> head(data)
[1] 435 435 435 278 278 278

Then you can just use rle to figure out the contiguous sequences:

> sequence(rle(data)$lengths)
[1] 1 2 3 1 2 3 4 5 1 2 3 4 1 2 1

Or altogether:

> head(cbind(data, sequence(rle(data)$lengths)))
[1,]  435 1
[2,]  435 2
[3,]  435 3
[4,]  278 1
[5,]  278 2
[6,]  278 3

For your case with multiple columns, there are probably a bunch of ways of applying this solution. Easiest might be to just paste the columns you care about together to form a single vector.

Upvotes: 9

Related Questions