shNIL
shNIL

Reputation: 963

loop calculation as it go in r

I am having difficulty executing a calculation that is defined iteratively. The following data serves as an example (actual data set much larger):

## DATA ##
# Columns
   Individual<-c("A","B","C","D","E","F","G","H1","H2","H3","H4","H5","K1","K2","K3","K4","K5")
   P1<-c(0,0,"A",0,"C","C",0, rep("E",5),"H1","H2","H3","H4","H5")
   P2<-c(0,0,"B",0,"D", "E",0,rep("G",5),"H1","H2","H3","H4","H5")
# Dataframe
   myd<-data.frame(Individual,P1,P2,stringsAsFactors=FALSE)


   Individual P1 P2
1           A  0  0
2           B  0  0
3           C  A  B
4           D  0  0
5           E  C  D
6           F  C  E
7           G  0  0
8          H1  E  G
9          H2  E  G
10         H3  E  G
11         H4  E  G
12         H5  E  G
13         K1 H1 H1
14         K2 H2 H2
15         K3 H3 H3
16         K4 H4 H4
17         K5 H5 H5

The data represents the relationship between and Individual and the two parents, P1, P2.

The calculation required, labeled relationA, represents how much each individual is related to A.

By definition, the relationship between A and A is given a value of 1. The values for all other individuals need to be computed based on the information in the table, as follows:

The value of relationA for an individual should be equal to 
   1/2 (the value of relationA of P1 of the individual)  
 + 1/2 (the value of relationA of P2 of the individual)

FOR EXAMPLE

  Individual P1 P2      relationA
1           A  0  0       1
2           B  0  0       0
3           C  A  B       (A = 1 + B = 0)/2 = 0.5
4           D  0  0       0
5           E  C  D       (C= 0.5 + D = 0)/2 = 0.25
6           F  C  E       (C = 0.5 + E = 0.25)/2 = 0.375  

The expected output is as the following:

 Individual P1 P2  relationA
1           A  0  0   1
2           B  0  0   0
3           C  A  B   0.5
4           D  0  0   0
5           E  C  D   0.25
6           F  C  E   0.375
7           G  0  0   0 
8          H1  E  G   0.125
9          H2  E  G   0.125
10         H3  E  G   0.125
11         H4  E  G   0.125
12         H5  E  G   0.125
13         K1 H1 H1   0.125
14         K2 H2 H2   0.125
15         K3 H3 H3   0.125
16         K4 H4 H4   0.125
17         K5 H5 H5   0.125

My difficulty is in expressing this in a proper manner in R. Any help would be appreciated.

Upvotes: 4

Views: 1009

Answers (2)

Brian Diggs
Brian Diggs

Reputation: 58825

You can write a function to compute the value given an individual and (implicitly) the relationships as a simple recursive function.

relationA <- function(ind) {
  if(ind == "A") {
    1
  } else if (ind == "0") {
    0
  } else {
    pts <- myd[myd$Individual == ind,]
    (relationA(pts[["P1"]]) + relationA(pts[["P2"]])) / 2
  }
}

Simply, if the individual is A, it is 1; if the individual is 0, it is 0; for anything else, recursively call relationA for each parent (P1 and P2) corresponding to the individual and add those together and divide by 2. This only works for a single individual at a time:

> relationA("A")
[1] 1
> relationA("F")
[1] 0.375
> relationA("K5")
[1] 0.125

but you can vectorize it over all the individuals relatively easily:

> sapply(myd$Individual, relationA)
    A     B     C     D     E     F     G    H1    H2    H3    H4    H5    K1 
1.000 0.000 0.500 0.000 0.250 0.375 0.000 0.125 0.125 0.125 0.125 0.125 0.125 
   K2    K3    K4    K5 
0.125 0.125 0.125 0.125 

and this can be assigned back to myd with

myd$relationA <- sapply(myd$Individual, relationA)

This is not particularly efficient because it has to compute relationA over and over again for each case. When it gets to "K5", it calls reationA("H5") twice, each of which call relationA("E") and relationA("G"), and those call relationA("C"), relationA("D"), relationA("0") and relationA("0"), etc.. That is, no results are cached but rather recomputed each time. For this small of a data set, it doesn't matter because even inefficient is still very fast.

If you want/need to cache the results and use that cache, then you can modify relationA to do so.

relationAc <- function(ind) {
  pts <- myd[myd$Individual == ind,]
  if(nrow(pts) == 0 | any(is.na(pts[["relationA"]]))) {
    relationA <-
      if(ind == "A") {
        1
      } else if (ind == "0") {
        0
      } else {
        (relationAc(pts[["P1"]]) + relationAc(pts[["P2"]])) / 2
      }
    myd[myd$Individual == ind, "relationA"] <<- relationA
    relationA
  } else {
    pts[["relationA"]]
  }
}

You do then have to initialize the cache:

myd$relationA <- NA_real_

A single call will fill in the needed values, and calling for the entire set of individuals will result in all the values being filled in.

> myd
   Individual P1 P2 relationA
1           A  0  0        NA
2           B  0  0        NA
3           C  A  B        NA
4           D  0  0        NA
5           E  C  D        NA
6           F  C  E        NA
7           G  0  0        NA
8          H1  E  G        NA
9          H2  E  G        NA
10         H3  E  G        NA
11         H4  E  G        NA
12         H5  E  G        NA
13         K1 H1 H1        NA
14         K2 H2 H2        NA
15         K3 H3 H3        NA
16         K4 H4 H4        NA
17         K5 H5 H5        NA
> relationAc("K5")
[1] 0.125
> myd
   Individual P1 P2 relationA
1           A  0  0     1.000
2           B  0  0     0.000
3           C  A  B     0.500
4           D  0  0     0.000
5           E  C  D     0.250
6           F  C  E        NA
7           G  0  0     0.000
8          H1  E  G        NA
9          H2  E  G        NA
10         H3  E  G        NA
11         H4  E  G        NA
12         H5  E  G     0.125
13         K1 H1 H1        NA
14         K2 H2 H2        NA
15         K3 H3 H3        NA
16         K4 H4 H4        NA
17         K5 H5 H5     0.125
> sapply(myd$Individual, relationAc)
    A     B     C     D     E     F     G    H1    H2    H3    H4    H5    K1 
1.000 0.000 0.500 0.000 0.250 0.375 0.000 0.125 0.125 0.125 0.125 0.125 0.125 
   K2    K3    K4    K5 
0.125 0.125 0.125 0.125 
> myd
   Individual P1 P2 relationA
1           A  0  0     1.000
2           B  0  0     0.000
3           C  A  B     0.500
4           D  0  0     0.000
5           E  C  D     0.250
6           F  C  E     0.375
7           G  0  0     0.000
8          H1  E  G     0.125
9          H2  E  G     0.125
10         H3  E  G     0.125
11         H4  E  G     0.125
12         H5  E  G     0.125
13         K1 H1 H1     0.125
14         K2 H2 H2     0.125
15         K3 H3 H3     0.125
16         K4 H4 H4     0.125
17         K5 H5 H5     0.125

Upvotes: 4

Ricardo Saporta
Ricardo Saporta

Reputation: 55350

Edit:

more succinctly, you could use sapply and rowSums to do away with the for-loop into a single line of code:

# Initialize values of relationA
myd$relationA <- 0
myd$relationA[myd$Individual=="A"] <- 1

# Calculate relationA
myd$relationA <-   myd$relationA + rowSums(sapply(myd$Individual, function(indiv) 
     myd$relationA[myd$Individual==indiv]/2 * ((myd$P1==indiv) + (myd$P2==indiv))))



Is what you are looking for something like this?

# Initialize values of relationA
myd$relationA <- 0
myd$relationA[myd$Individual=="A"] <- 1


# Iterate over all Individuals
for (indiv in myd$Individual) {

  indiVal <- myd$relationA[myd$Individual==indiv]

  # all columns handled at once, thanks to vectorization;  no need for myd$P1[i]
  myd$relationA <- myd$relationA  + 
                 indiVal/2 * ((myd$P1==indiv) + (myd$P2==indiv))
}

Output

myd
   Individual P1 P2 relationA
1           A  0  0     1.000
2           B  0  0     0.000
3           C  A  B     0.500
4           D  0  0     0.000
5           E  C  D     0.250
6           F  C  E     0.375
7           G  0  0     0.000
8          H1  E  G     0.125
9          H2  E  G     0.125
...

Upvotes: 3

Related Questions