user2542995
user2542995

Reputation: 241

Multiple loops following string and subsets

I'm currently stuck with my data frame and i would like to know how to do "subsets of subsets of subsets" Here is a part of my data frame:

    YEAR    RN      DATE    NAME            SITE           LONG SP                         SUMNB            NB100
1   2011    RNN027  15056   ESTAGNOL    RNN027-Estagnol 02  310 Anthocharis cardamines (Linnaeus, 1758) 1   0.3225806
2   2011    RNN027  15075   ESTAGNOL    RNN027-Estagnol 02  310 Anthocharis cardamines (Linnaeus, 1758) 1   0.3225806
3   2003    RNN027  12166   ESTAGNOL    RNN027-Estagnol 03  330 Anthocharis cardamines (Linnaeus, 1758) 2   0.6060606
4   2006    RNN027  13252   ESTAGNOL    RNN027-Estagnol 03  330 Anthocharis cardamines (Linnaeus, 1758) 2   0.6060606
5   2006    RNN027  13257   ESTAGNOL    RNN027-Estagnol 03  330 Anthocharis cardamines (Linnaeus, 1758) 2   0.6060606
6   2005    RNN027  12895   ESTAGNOL    RNN027-Estagnol 01  540 Anthocharis cardamines (Linnaeus, 1758) 2   0.3703704

My point is to compute a abundance factor for each species. To do that, i have to isolate every count date for every species, every year, and every site.

My first idea was to do multiple loops and subseting every step by the previous criteria:

DF --> Loop SITE ; subset of each SITE -->loop YEAR; subset of each YEAR -->loop SP; subset of each SPECIES--> dates of observations

The point of isolating these dates require further modifications (adding rows), but i need to be capable of rewriting the modified subsets afterwards and reconstruct a new dataframe.

I built my loops command:

LOOPSITE<-sort(unique(DF$SITE))
for(i in LOOPSITE){
  print(i)
  LOOPSITESUB<-subset(DF,grepl(i,SITE))
  LOOPYEAR<-sort(unique(LOOPSITESUB$YEAR))
  print(LOOPYEAR)

  for(j in LOOPYEAR){
    print(j)
    LOOPYEARSUB<-subset(LOOPSITESUB,grepl(j,YEAR))
    LOOPSP<-sort(unique(LOOPYEARSUB$SP))
    print(length(LOOPSP))
       for(k in LOOPSP){
         print(k)
         LOOPSPSUB<-subset(LOOPYEARSUB,grepl(k,SP))
         print(sum(LOOPYEARSUB$SUMNB))
         print(head(LOOPSPSUB))
    }      
  }
}

I am able to follow that my script is working with all these "print" commands, and it is working until i reach the species subseting. For an unknown reason, the last subsetting dont concern each species, but only some of them. Here is a part of the output for the last SITE and the last YEAR:

"RNN027-Estagnol 01"
...(I skipped all the sites)
"RNN027-Estagnol 06"
"2003"
...(I skipped all the years)
"2011"
[1] 22
[1] "Aricia agestis D., 1775"
[1] 107
   YEAR     RN       DATE      NOM               SITE LONG                      SP SUMNB     NB100
66 2011 RNN027 2011-04-21 ESTAGNOL RNN027-Estagnol 06  260 Aricia agestis D., 1775         1 0.3846154
67 2011 RNN027 2011-05-22 ESTAGNOL RNN027-Estagnol 06  260 Aricia agestis D., 1775     1 0.3846154
68 2011 RNN027 2011-08-05 ESTAGNOL RNN027-Estagnol 06  260 Aricia agestis D., 1775     2 0.7692308
[1] "Brintesia circe (Fabricius, 1775)"
[1] 107
[1] YEAR  RN    DATE  NOM   SITE  LONG  SP    SUMNB NB100
<0 rows> (or 0-length row.names)
[1] "Carcharodus alceae (Esper, 1780)"
[1] 107
[1] YEAR  RN    DATE  NOM   SITE  LONG  SP    SUMNB NB100
<0 rows> (or 0-length row.names)

It is working for "Aricia agestis D., 1775" but not for "Brintesia circe (Fabricius, 1775)". I verified on my dataframe, that second species have been observed at this time and place,and have the same format than the previous one...it should be working.

How many loops can i stack like this ? Is there another way to do that? (it would be convenient and faster). I'm aware of the "split" function, who basically dismont every group, but as i cant exploit every"chunk", it dont fit to my task. I am maybe wrong.

At the last step (after modifing all the subsets), i should be able to write each subset in a new dataframe to reconstruct a modified version of my input.

I'm am maybe on the wrongest way i possibly can go! I can provide further explanations if needed!

Thanks for your help!

EDIT:

I'll try to explain what i want to do. In order to calculate my abundance index, i need to add "blank" rows before and after each temporal "session" of observation. Basically, i try to obtain a subset for every combination of 3 differents factors (SITE, YEAR and SP).

Here is an example of the type of output i would like to obtain. For every SITE X/YEAR Y/SP Z possible combination:

 YEAR    RN      DATE    NAME            SITE           LONG SP  SUMNB NB100
----ADD A NEW ROW----DATE MINUS 7 DAYS-----------------------------------------------------------------------------------
1   Y    RNN027  15056   ESTAGNOL    RNN027-Estagnol X  310 SP Z  1   0.3225806
2   Y    RNN027  15075   ESTAGNOL    RNN027-Estagnol X  310 SP Z  1   0.3225806
3   Y    RNN027  12166   ESTAGNOL    RNN027-Estagnol X  330 SP Z  2   0.6060606
4   Y    RNN027  13252   ESTAGNOL    RNN027-Estagnol X  330 SP Z  2   0.6060606
5   Y    RNN027  13257   ESTAGNOL    RNN027-Estagnol X  330 SP Z  2   0.6060606
6   Y    RNN027  12895   ESTAGNOL    RNN027-Estagnol X  540 SP Z  2   0.3703704
----ADD A NEW ROW----DATE PLUS 7 DAYS-----------------------------------------------------------------------------------

Then i rewrite and compile every modified subset in a new DF.

EDIT 2: The use of "split(DF, list(DF$SITE, DF$YEAR, DF$SP))" crashed my computer, unless I dropped the unused values. I got exactly what I want, but how can I access and modify every subset ?

Upvotes: 1

Views: 152

Answers (2)

Sven Hohenstein
Sven Hohenstein

Reputation: 81733

I suppose you are looking for aggregate.

aggregate(SUMNB ~ SITE + YEAR + SP, DF, sum)

#                 SITE YEAR                                      SP SUMNB
# 1 RNN027-Estagnol 03 2003 Anthocharis cardamines (Linnaeus, 1758)     2
# 2 RNN027-Estagnol 01 2005 Anthocharis cardamines (Linnaeus, 1758)     2
# 3 RNN027-Estagnol 03 2006 Anthocharis cardamines (Linnaeus, 1758)     4
# 4 RNN027-Estagnol 02 2011 Anthocharis cardamines (Linnaeus, 1758)     2

The command calculates the sum of all values in SUMNB for each combination of SITE, YEAR and SP.


Edit

Does the following code produce what your're looking for?

do.call(rbind, by(DF, DF[c("SITE", "YEAR", "SP")], FUN = function(x) {
  tmp <- x[c(1, seq(nrow(x)), nrow(x)), ]
  tmp$DATE[1] < tmp$DATE[1] - 7
  tmp$DATE[nrow(tmp)] <- tmp$DATE[nrow(tmp)] + 7
  return(tmp)
}))

Upvotes: 3

alexis_laz
alexis_laz

Reputation: 13122

Based on your edits, I believe this could be useful:

set.seed(11)
DF <- data.frame(YEAR = sample(c(2001, 2003), 5, T),     #random data
                 SITE = sample(c("a", "b"), 5, T),
                 SP = sample(c("sp1", "sp2"), 5, T),
                 DATE = sample(12345:15678, 5))

res <- lapply(split(DF, list(DF$SITE, DF$YEAR, DF$SP)), 
                function(x) 
                {
                 if(nrow(x) > 0)
                  {
                   row1 <- x[1,]
                   names(row1) <- colnames(x)
                   row1["DATE"] <- x$DATE[1] - 7

                   rown <- x[nrow(x),]
                   names(rown) <- colnames(x)
                   rown["DATE"] <- x$DATE[nrow(x)] + 7

                   rbind(row1, x, rown)
                  } 
                })
DF2 <- do.call(rbind, res)
rownames(DF2) = seq_len(nrow(DF2))

DF
#  YEAR SITE  SP  DATE
#1 2001    b sp1 14257
#2 2001    a sp1 13950
#3 2003    a sp2 13446
#4 2001    b sp2 12870
#5 2001    a sp2 13943
DF2
#   YEAR SITE  SP  DATE
#1  2001    a sp1 13943
#2  2001    a sp1 13950
#3  2001    a sp1 13957
#4  2001    b sp1 14250
#5  2001    b sp1 14257
#6  2001    b sp1 14264
#7  2001    a sp2 13936
#8  2001    a sp2 13943
#9  2001    a sp2 13950
#10 2001    b sp2 12863
#11 2001    b sp2 12870
#12 2001    b sp2 12877
#13 2003    a sp2 13439
#14 2003    a sp2 13446
#15 2003    a sp2 13453

Upvotes: 1

Related Questions