How to run a for loop for each group in a dataframe?

Question

This question is similar to this one asked earlier but not quite. I would like to iterate through a large dataset (~500,000 rows) and for each unique value in one column, I would like to do some processing of all the values in another column.

Here is code that I have confirmed to work:

df = matrix(nrow=783,ncol=2)
counts = table(csvdata$value)
p = (as.vector(counts))/length(csvdata$value)
D = 1 - sum(p**2)

The only problem with it is that it returns the value D for the entire dataset, rather than returning a separate D value for each set of rows where ID is the same.

Say I had data like this:

How would I be able to do the same thing as the code above, but return a D value for each group of rows where ID is the same, rather than for the entire dataset? I imagine this requires a loop, and creating a matrix to store all the D values in with ID in one column and the value of D in the other, but not sure.

Maurits Evers · Accepted Answer

Ok, let's work with "In short, I would like whatever is in the for loop to be executed for each block of data with a unique value of "ID"".

In general you can group rows by values in one column (e.g. "ID") and then perform some transformation based on values/entries in other columns per group. In the tidyverse this would look like this

library(tidyverse)
df %>%
    group_by(ID) %>%
    mutate(value.mean = mean(value))
## A tibble: 8 x 3
## Groups:   ID [3]
#  ID    value value.mean
#         
#1 a        13       12.6
#2 a        14       12.6
#3 a        12       12.6
#4 a        13       12.6
#5 a        11       12.6
#6 b        12       15.5
#7 b        19       15.5
#8 cc4      10       10.0

Here we calculate the mean of value per group, and add these values to every row. If instead you wanted to summarise values, i.e. keep only the summarised value(s) per group, you would use summarise instead of mutate.

library(tidyverse)
df %>%
    group_by(ID) %>%
    summarise(value.mean = mean(value))
## A tibble: 3 x 2
#  ID    value.mean
#        
#1 a           12.6
#2 b           15.5
#3 cc4         10.0

The same can be achieved in base R using one of tapply, ave, by. As far as I understand your problem statement there is no need for a for loop. Just apply a function (per group).

Sample data

df <- read.table(text =
    "ID value
a 13
a 14
a 12
a 13
a 11
b 12
b 19
cc4 10", header = T)

Update

To conclude from the comments&chat, this should be what you're after.

# Sample data
set.seed(2017)
csvdata <- data.frame(
    microsat = rep(c("A", "B", "C"), each = 8),
    allele = sample(20, 3 * 8, replace = T))

csvdata %>%
    group_by(microsat) %>%
    summarise(D = 1 - sum(prop.table(table(allele))^2))
## A tibble: 3 x 2
#  microsat     D
#      
#1 A        0.844
#2 B        0.812
#3 C        0.812

Note that prop.table returns fractions and is shorter than your (as.vector(counts))/length(csvdata$value). Note also that you can reproduce your results for all values (irrespective of ID) if you omit the group_by line.

How to run a for loop for each group in a dataframe?

Answers (2)

Sample data

Update

Related Questions