user113156
user113156

Reputation: 7107

for loop across two ID columns

I am trying to create a basic for loop function to print the summary statistics of some data. I have two ID variables

k <- unique(df$ID_1)
m <- unique(df$ID_2)

func <- 
  for (i in 1:k) {
  for (j in 1:m) {
    tmp[i] <- summary([[i]])
    print(tmp[i])
    }
}

I am having trouble with the j in 1:m part of the loop and how to handle it to print the summary statistics.

I have:

for i in 1:k(countries) for j in 1:m(industries) print summary statistics for each i across j essentially saving it in a new dataframe.

Just a little unsure how to handle the j part in this for loop.

Here is the first 40 observations of a random sample of the data:

df = structure(list(Major.sectors.id = c(13L, 6L, 1L, 7L, 7L, 5L,  11L, 5L, 5L, 13L, 7L, 5L, 5L, 7L, 7L, 11L, 7L, 5L, 3L, 7L, 5L,  2L, 5L, 7L, 5L, 16L, 16L, 3L, 5L, 8L, 5L, 16L, 5L, 3L, 11L, 2L,  7L, 7L, 5L, 5L), Region.in.country.id = c(14L, 2L, 2L, 1L, 4L,  2L, 8L, 2L, 2L, 11L, 4L, 10L, 1L, 1L, 8L, 11L, 11L, 5L, 9L, 15L,  1L, 3L, 1L, 2L, 1L, 1L, 6L, 14L, 1L, 1L, 10L, 8L, 3L, 15L, 10L,  12L, 5L, 2L, 1L, 10L), CL.CA
= c(0.760887885213847, 1.65052687741763, 
0.0806197205130879, 0.81806086270666, 0.288078797135706, 4.40245492070389, 
1.04708699122107, 0.303834508544838, 0.660743728653869, 0.588710351051974, 
0.798830178759201, 0.927690515241424, 0.425398597752096, 3.94063680888526, 
0.23268050036403, 0.0476527647463019, 0.793725271068126, 7.57011718390699, 
0.313488176978694, 0.932388646476865, 0.528403681370993, 0.327018095170076, 
0.026490765171504, 0.0684209603311003, 0.199002799160252, 1.39359718123149, 
1.46636993076162, 1.4949863023741, 0.950167224080268, 0.976629282263177, 
0.0311026456992869, 0.850753860127157, 1.13918820019141, 0.801954640022603, 
71.4120603015075, 1.81240164673211, 0.813709616403117, 1.16837699936937, 
0.535646685572253, 0.54046260697011), WC.TA = c(0.912151080743853, 
-0.149811669390946, 0, 0.317269061162005, 0.404991527614234, 
0.0917180013689254, 0, 0.0363331984252548, 0.0581172322430804, 
0.0984968038873609, 0.404607993560882, 0.429639957001037, 0.0310040298241571, 
0.51689621232918, 0.20287556599724, 0.0768790516016589, 0.470663333767837, 
0.115332950488128, 0.959409600460894, 0.810159261144644, -0.000968329915698337, 
0.276768734669129, -0.00197889182058048, 0.841627179101607, 0.337732164517433, 
0.0312224689196786, 0.0905910619894282, 0.554192762156444, 0.0112172829248027, 
0.365589304257529, 0.155960321393463, 0.781180744777475, 0.129165904803528, 
0.815471911266304, -0.0833032382328157, 0.312503120641987, 0.637601499063086, 
0.482001582278481, 0.344956547341399, 0.142965448177925)), .Names = c("Major.sectors.id",  "Region.in.country.id", "CL.CA", "WC.TA"), row.names = c(102291L,  68029L, 67292L, 101860L, 94123L, 77664L, 18713L, 39361L, 61163L,  96185L, 92109L, 5411L, 52390L, 95008L, 12084L, 82522L, 102897L,  61661L, 33672L, 89727L, 76163L, 25942L, 10525L, 38400L, 24587L,  33894L, 20545L, 20563L, 10163L, 55586L, 77839L, 59394L, 86677L,  43471L, 77523L, 2190L, 61390L, 77768L, 95584L, 70096L), class = "data.frame")

EDIT @nate.edwinton : for comment below;

Using the dputoutput in the original post I run the following code:

k <- unique(df$Major.sectors.id)
 m <- unique(df$Region.in.country.id)

    colnames(df)
     for(i in 1:length(m)){
       country.ID <- m[i]    # provided ID_2 corresponds to countries
       for(j in 1:length(k)){
         sector.ID <- k[j]   # provided ID_1 corresponds to sectors
         S1 <- which(df$Major.sectors.id == sector.ID)
         S2 <- which(df$Region.in.country.id == country.ID)
         rows.2.consider <- intersect(S1, S2)
         tmp.summary <- summary(df[rows.2.consider,])
         print(tmp.summary)
         rm(country.ID, sector.ID, S1, S2, rows.2.consider, tmp.summary, i, j)
       }
     }
     `

Upvotes: 0

Views: 88

Answers (1)

niko
niko

Reputation: 5281

So this might help you

for(i in 1:length(m)){
  country.ID <- m[i]    # provided ID_2 corresponds to countries
  for(j in 1:length(k)){
    sector.ID <- k[j]   # provided ID_1 corresponds to sectors
    S1 <- which(df$ID_1 == sector.ID)
    S2 <- which(df$ID_2 == country.ID)
    rows.2.consider <- intersect(S1, S2)
    tmp.summary <- summary(df[rows.2.consider,])
    print(tmp.summary)
    rm(sector.ID, S1, S2, rows.2.consider, tmp.summary, j)
    }
   rm(country.ID, i)
 }

Update

Here a portion of the output:

 Major.sectors.id Region.in.country.id     CL.CA            WC.TA       
 Min.   :13       Min.   :14           Min.   :0.7609   Min.   :0.9122  
 1st Qu.:13       1st Qu.:14           1st Qu.:0.7609   1st Qu.:0.9122  
 Median :13       Median :14           Median :0.7609   Median :0.9122  
 Mean   :13       Mean   :14           Mean   :0.7609   Mean   :0.9122  
 3rd Qu.:13       3rd Qu.:14           3rd Qu.:0.7609   3rd Qu.:0.9122  
 Max.   :13       Max.   :14           Max.   :0.7609   Max.   :0.9122  
 Major.sectors.id Region.in.country.id     CL.CA         WC.TA    
 Min.   : NA      Min.   : NA          Min.   : NA   Min.   : NA  
 1st Qu.: NA      1st Qu.: NA          1st Qu.: NA   1st Qu.: NA  
 Median : NA      Median : NA          Median : NA   Median : NA  
 Mean   :NaN      Mean   :NaN          Mean   :NaN   Mean   :NaN  
 3rd Qu.: NA      3rd Qu.: NA          3rd Qu.: NA   3rd Qu.: NA  
 Max.   : NA      Max.   : NA          Max.   : NA   Max.   : NA

The way I see it, the explanation for all these NAs and NaNs is fairly simple: The loop runs through all possible (country.id, region.id) combinations. However, for some combinations there is no data, e.g. there is a row in df corresponding to sector.id == 13 and region.id == 14, however there is no row with sector.id == 6 and region.id == 14. This explains the NAs in the second summary above.

I am still a bit puzzled by the NaN in Mean (I'm asking myself why not NA as in Median), but I guess it has something to do with the way the function is defined.

Finally, the reason sum(is.na(df)) returns 0 is that there is no NA in df! Indeed, which(is.na(df)==TRUE) returns integer(0). So in other words, you are summing up over nothing. is.na(df) is a logical vector consisting of TRUEs or FALSEs. Here, because there is no NA in df, it consists only of FALSEs. Now, these logical operators can be used to perform mathematical operations: TRUE is then seen as 1 and FALSE as 0. This explains it. And if one tries to sum up NAs, one obtains the following: sum(c(NA,NA)) returns NA.

Upvotes: 1

Related Questions