Reputation: 1433
This question is similar to this one asked earlier but not quite. I would like to iterate through a large dataset (~500,000 rows) and for each unique value in one column, I would like to do some processing of all the values in another column.
Here is code that I have confirmed to work:
df = matrix(nrow=783,ncol=2)
counts = table(csvdata$value)
p = (as.vector(counts))/length(csvdata$value)
D = 1 - sum(p**2)
The only problem with it is that it returns the value D
for the entire dataset, rather than returning a separate D
value for each set of rows where ID
is the same.
How would I be able to do the same thing as the code above, but return a D value for each group of rows where ID
is the same, rather than for the entire dataset? I imagine this requires a loop, and creating a matrix to store all the D values in with ID
in one column and the value of D
in the other, but not sure.
Upvotes: 0
Views: 7607
Reputation: 50738
Ok, let's work with "In short, I would like whatever is in the for loop to be executed for each block of data with a unique value of "ID"".
In general you can group rows by values in one column (e.g. "ID"
) and then perform some transformation based on values/entries in other columns per group. In the tidyverse
this would look like this
library(tidyverse)
df %>%
group_by(ID) %>%
mutate(value.mean = mean(value))
## A tibble: 8 x 3
## Groups: ID [3]
# ID value value.mean
# <fct> <int> <dbl>
#1 a 13 12.6
#2 a 14 12.6
#3 a 12 12.6
#4 a 13 12.6
#5 a 11 12.6
#6 b 12 15.5
#7 b 19 15.5
#8 cc4 10 10.0
Here we calculate the mean of value
per group, and add these values to every row. If instead you wanted to summarise values, i.e. keep only the summarised value(s) per group, you would use summarise
instead of mutate
.
library(tidyverse)
df %>%
group_by(ID) %>%
summarise(value.mean = mean(value))
## A tibble: 3 x 2
# ID value.mean
# <fct> <dbl>
#1 a 12.6
#2 b 15.5
#3 cc4 10.0
The same can be achieved in base R using one of tapply
, ave
, by
. As far as I understand your problem statement there is no need for a for
loop. Just apply a function (per group).
df <- read.table(text =
"ID value
a 13
a 14
a 12
a 13
a 11
b 12
b 19
cc4 10", header = T)
To conclude from the comments&chat, this should be what you're after.
# Sample data
set.seed(2017)
csvdata <- data.frame(
microsat = rep(c("A", "B", "C"), each = 8),
allele = sample(20, 3 * 8, replace = T))
csvdata %>%
group_by(microsat) %>%
summarise(D = 1 - sum(prop.table(table(allele))^2))
## A tibble: 3 x 2
# microsat D
# <fct> <dbl>
#1 A 0.844
#2 B 0.812
#3 C 0.812
Note that prop.table
returns fractions and is shorter than your (as.vector(counts))/length(csvdata$value)
. Note also that you can reproduce your results for all values (irrespective of ID
) if you omit the group_by
line.
Upvotes: 2
Reputation: 887951
A base R
option would be
df1$value.mean <- with(df1, ave(value, ID))
Upvotes: 0