Reputation: 963
I am having difficulty executing a calculation that is defined iteratively. The following data serves as an example (actual data set much larger):
## DATA ##
# Columns
Individual<-c("A","B","C","D","E","F","G","H1","H2","H3","H4","H5","K1","K2","K3","K4","K5")
P1<-c(0,0,"A",0,"C","C",0, rep("E",5),"H1","H2","H3","H4","H5")
P2<-c(0,0,"B",0,"D", "E",0,rep("G",5),"H1","H2","H3","H4","H5")
# Dataframe
myd<-data.frame(Individual,P1,P2,stringsAsFactors=FALSE)
Individual P1 P2
1 A 0 0
2 B 0 0
3 C A B
4 D 0 0
5 E C D
6 F C E
7 G 0 0
8 H1 E G
9 H2 E G
10 H3 E G
11 H4 E G
12 H5 E G
13 K1 H1 H1
14 K2 H2 H2
15 K3 H3 H3
16 K4 H4 H4
17 K5 H5 H5
The data represents the relationship between and Individual and the two parents, P1
, P2
.
The calculation required, labeled relationA
, represents how much each individual is related to A.
By definition, the relationship between A and A is given a value of 1. The values for all other individuals need to be computed based on the information in the table, as follows:
The value of relationA for an individual should be equal to
1/2 (the value of relationA of P1 of the individual)
+ 1/2 (the value of relationA of P2 of the individual)
FOR EXAMPLE
Individual P1 P2 relationA
1 A 0 0 1
2 B 0 0 0
3 C A B (A = 1 + B = 0)/2 = 0.5
4 D 0 0 0
5 E C D (C= 0.5 + D = 0)/2 = 0.25
6 F C E (C = 0.5 + E = 0.25)/2 = 0.375
The expected output is as the following:
Individual P1 P2 relationA
1 A 0 0 1
2 B 0 0 0
3 C A B 0.5
4 D 0 0 0
5 E C D 0.25
6 F C E 0.375
7 G 0 0 0
8 H1 E G 0.125
9 H2 E G 0.125
10 H3 E G 0.125
11 H4 E G 0.125
12 H5 E G 0.125
13 K1 H1 H1 0.125
14 K2 H2 H2 0.125
15 K3 H3 H3 0.125
16 K4 H4 H4 0.125
17 K5 H5 H5 0.125
My difficulty is in expressing this in a proper manner in R
. Any help would be appreciated.
Upvotes: 4
Views: 1009
Reputation: 58825
You can write a function to compute the value given an individual and (implicitly) the relationships as a simple recursive function.
relationA <- function(ind) {
if(ind == "A") {
1
} else if (ind == "0") {
0
} else {
pts <- myd[myd$Individual == ind,]
(relationA(pts[["P1"]]) + relationA(pts[["P2"]])) / 2
}
}
Simply, if the individual is A, it is 1; if the individual is 0, it is 0; for anything else, recursively call relationA
for each parent (P1
and P2
) corresponding to the individual and add those together and divide by 2. This only works for a single individual at a time:
> relationA("A")
[1] 1
> relationA("F")
[1] 0.375
> relationA("K5")
[1] 0.125
but you can vectorize it over all the individuals relatively easily:
> sapply(myd$Individual, relationA)
A B C D E F G H1 H2 H3 H4 H5 K1
1.000 0.000 0.500 0.000 0.250 0.375 0.000 0.125 0.125 0.125 0.125 0.125 0.125
K2 K3 K4 K5
0.125 0.125 0.125 0.125
and this can be assigned back to myd
with
myd$relationA <- sapply(myd$Individual, relationA)
This is not particularly efficient because it has to compute relationA
over and over again for each case. When it gets to "K5", it calls reationA("H5")
twice, each of which call relationA("E")
and relationA("G")
, and those call relationA("C")
, relationA("D")
, relationA("0")
and relationA("0")
, etc.. That is, no results are cached but rather recomputed each time. For this small of a data set, it doesn't matter because even inefficient is still very fast.
If you want/need to cache the results and use that cache, then you can modify relationA
to do so.
relationAc <- function(ind) {
pts <- myd[myd$Individual == ind,]
if(nrow(pts) == 0 | any(is.na(pts[["relationA"]]))) {
relationA <-
if(ind == "A") {
1
} else if (ind == "0") {
0
} else {
(relationAc(pts[["P1"]]) + relationAc(pts[["P2"]])) / 2
}
myd[myd$Individual == ind, "relationA"] <<- relationA
relationA
} else {
pts[["relationA"]]
}
}
You do then have to initialize the cache:
myd$relationA <- NA_real_
A single call will fill in the needed values, and calling for the entire set of individuals will result in all the values being filled in.
> myd
Individual P1 P2 relationA
1 A 0 0 NA
2 B 0 0 NA
3 C A B NA
4 D 0 0 NA
5 E C D NA
6 F C E NA
7 G 0 0 NA
8 H1 E G NA
9 H2 E G NA
10 H3 E G NA
11 H4 E G NA
12 H5 E G NA
13 K1 H1 H1 NA
14 K2 H2 H2 NA
15 K3 H3 H3 NA
16 K4 H4 H4 NA
17 K5 H5 H5 NA
> relationAc("K5")
[1] 0.125
> myd
Individual P1 P2 relationA
1 A 0 0 1.000
2 B 0 0 0.000
3 C A B 0.500
4 D 0 0 0.000
5 E C D 0.250
6 F C E NA
7 G 0 0 0.000
8 H1 E G NA
9 H2 E G NA
10 H3 E G NA
11 H4 E G NA
12 H5 E G 0.125
13 K1 H1 H1 NA
14 K2 H2 H2 NA
15 K3 H3 H3 NA
16 K4 H4 H4 NA
17 K5 H5 H5 0.125
> sapply(myd$Individual, relationAc)
A B C D E F G H1 H2 H3 H4 H5 K1
1.000 0.000 0.500 0.000 0.250 0.375 0.000 0.125 0.125 0.125 0.125 0.125 0.125
K2 K3 K4 K5
0.125 0.125 0.125 0.125
> myd
Individual P1 P2 relationA
1 A 0 0 1.000
2 B 0 0 0.000
3 C A B 0.500
4 D 0 0 0.000
5 E C D 0.250
6 F C E 0.375
7 G 0 0 0.000
8 H1 E G 0.125
9 H2 E G 0.125
10 H3 E G 0.125
11 H4 E G 0.125
12 H5 E G 0.125
13 K1 H1 H1 0.125
14 K2 H2 H2 0.125
15 K3 H3 H3 0.125
16 K4 H4 H4 0.125
17 K5 H5 H5 0.125
Upvotes: 4
Reputation: 55350
more succinctly, you could use sapply
and rowSums
to do away with the for-loop
into a single line of code:
# Initialize values of relationA
myd$relationA <- 0
myd$relationA[myd$Individual=="A"] <- 1
# Calculate relationA
myd$relationA <- myd$relationA + rowSums(sapply(myd$Individual, function(indiv)
myd$relationA[myd$Individual==indiv]/2 * ((myd$P1==indiv) + (myd$P2==indiv))))
Is what you are looking for something like this?
# Initialize values of relationA
myd$relationA <- 0
myd$relationA[myd$Individual=="A"] <- 1
# Iterate over all Individuals
for (indiv in myd$Individual) {
indiVal <- myd$relationA[myd$Individual==indiv]
# all columns handled at once, thanks to vectorization; no need for myd$P1[i]
myd$relationA <- myd$relationA +
indiVal/2 * ((myd$P1==indiv) + (myd$P2==indiv))
}
Output
myd
Individual P1 P2 relationA
1 A 0 0 1.000
2 B 0 0 0.000
3 C A B 0.500
4 D 0 0 0.000
5 E C D 0.250
6 F C E 0.375
7 G 0 0 0.000
8 H1 E G 0.125
9 H2 E G 0.125
...
Upvotes: 3