R - select years within a data frame based on calculated values from another columns

Question

I'm rather stuck. I need to perform some calculations on selected years, which are different for different locations based on additional calculations. I have a data frame that looks like the following sample df:

t0 <- 2017
tf <- 2046
pH_drift <- 0.5
fert_lim <- (pH_drift/0.1)*11.21 # amount of fertilizer needed to decrease pH level by the target drift threshold (i.e. pH_drift)

state<-c(rep("IL",90),rep("IN",90))
county <- c(rep("a",30),rep("b",30),rep("c",30),rep("d",30),rep("e",30),rep("f",30))
for (i in length(county)) {year <- c(2017:2046)}
nfert <- c(sample(20,90,replace = TRUE),sample(13,90,replace = TRUE))
sand <- c(sample(100,90,replace = TRUE),sample(75,90,replace=TRUE))

This creates a data frame with the following format: "state", "county", "nitrogen fertilization", and "sand", from year 2017 to 2046

e.g.

    state county year nfert sand 
1      IL      a 2017    11   99      
2      IL      a 2018     8   84      
3      IL      a 2019     7   43      
4      IL      a 2020     7   57      
5      IL      a 2021    20   15

I need to identify years to apply lime based on mean fertilizer application (nfert) per state. That is, each state will have an average fertilizer load, and using the following formula (fert_lim/n_load) will give me the number of years between liming events. With this information, I need to identify the years from 2017 to 2046 when lime will be applied for each state and perform some additional calculations for those years. That is, the interval between liming events will be different for each state.

I tried the following code to create a column with 1 and 0 identifying the years of liming for each state:

lime_y <- NULL
st <- c(unique(state))
for (i in 1:length(st)) {
  n_load <- mean(nfert)
  lime_int <- round(fert_lim/n_load,digits = 0)   
  lime_yr <- c(seq(t0,tf,by=lime_int))
  flag <- period %in% lime_yr
  lime_yi <- ifelse(flag, 1,0)
  lime_y <- c(lime_y,lime_yi)
}

df <- data.frame(state,county,year,nfert,sand,lime_y)

but it uses the average for all states (in this case, 7). So I tried:

lime_y <- NULL
st <- c(unique(state))
for (i in 1:length(st)) {
  n_load <- mean(df$nfert[st])
  lime_int <- round(fert_lim/n_load,digits = 0)   
  lime_yr <- c(seq(t0,tf,by=lime_int))
  flag <- period %in% lime_yr
  lime_yi <- ifelse(flag, 1,0)
  lime_y <- c(lime_y,lime_yi)
}

df <- data.frame(state,county,year,nfert,sand,lime_y)

To differenciate between states, but I get the following error: Error in seq.default(2017, 2046, by = lime_int) : invalid '(to - from)/by'.

The output should look like this:

    state county year nfert sand lime_y
1      IL      a 2017    11   99      1
2      IL      a 2018     8   84      0
3      IL      a 2019     7   43      0
4      IL      a 2020     7   57      0
5      IL      a 2021    20   15      0
6      IL      a 2022    17   75      0
7      IL      a 2023    13   21      0
8      IL      a 2024     6   87      1
9      IL      a 2025     5   56      0
10     IL      a 2026     9   11      0
11     IL      a 2027    19   50      0
12     IL      a 2028    10    6      0
13     IL      a 2029     3   99      0
14     IL      a 2030    16   10      0
15     IL      a 2031     7    4      1
16     IL      a 2032     6    4      0
17     IL      a 2033    10   52      0
18     IL      a 2034    20   47      0
19     IL      a 2035    17   27      0
20     IL      a 2036    12   54      0
21     IL      a 2037    16   81      0
22     IL      a 2038    10   84      1
23     IL      a 2039    15   68      0
24     IL      a 2040    19   33      0
25     IL      a 2041    14   80      0
26     IL      a 2042    18   48      0
27     IL      a 2043    10   58      0
28     IL      a 2044     3   68      0
29     IL      a 2045    10   62      1
30     IL      a 2046    20   94      0
31     IL      b 2017    13   92      1
32     IL      b 2018    10   85      0
33     IL      b 2019    17   33      0
34     IL      b 2020     2   94      0
35     IL      b 2021     4    9      0
36     IL      b 2022     8   61      0
37     IL      b 2023     3   16      0
38     IL      b 2024    11   59      1
39     IL      b 2025    12   71      0
40     IL      b 2026    11   87      0
41     IL      b 2027     2   83      0
42     IL      b 2028    11    2      0
43     IL      b 2029     3   11      0
44     IL      b 2030     1   22      0
45     IL      b 2031     5   51      1
46     IL      b 2032     9   61      0
47     IL      b 2033    20   63      0
48     IL      b 2034     4   70      0
49     IL      b 2035     1   92      0
50     IL      b 2036    14   73      0
51     IL      b 2037    10   51      0
52     IL      b 2038     6   14      1
53     IL      b 2039     1   58      0
54     IL      b 2040     1   89      0
55     IL      b 2041    18   30      0
56     IL      b 2042    10   34      0
57     IL      b 2043     9   69      0
58     IL      b 2044    20   77      0
59     IL      b 2045     7   16      1
60     IL      b 2046     3   74      0
61     IL      c 2017     4   12      1
62     IL      c 2018     1   89      0
63     IL      c 2019    14   44      0
64     IL      c 2020     4   62      0
65     IL      c 2021     5   62      0
66     IL      c 2022     7   47      0
67     IL      c 2023    18   25      0
68     IL      c 2024    16   85      1
69     IL      c 2025    10   26      0
70     IL      c 2026    15   65      0
71     IL      c 2027     3   33      0
72     IL      c 2028     1   43      0
73     IL      c 2029     3   43      0
74     IL      c 2030     5   37      0
75     IL      c 2031     4    5      1
76     IL      c 2032    14   16      0
77     IL      c 2033     7  100      0
78     IL      c 2034    16   10      0
79     IL      c 2035    15   84      0
80     IL      c 2036    15   25      0
81     IL      c 2037    15   48      0
82     IL      c 2038     3   59      1
83     IL      c 2039    20   73      0
84     IL      c 2040     1   12      0
85     IL      c 2041     2   19      0
86     IL      c 2042     5   85      0
87     IL      c 2043     9   27      0
88     IL      c 2044     9   90      0
89     IL      c 2045     1   49      1
90     IL      c 2046     4   89      0
91     IN      d 2017     8   19      1
92     IN      d 2018    10   74      0
93     IN      d 2019    10   19      0
94     IN      d 2020     8   15      0
95     IN      d 2021     7    4      0
96     IN      d 2022     7   28      0
97     IN      d 2023    12   34      0
98     IN      d 2024    10   22      1
99     IN      d 2025     6   37      0
100    IN      d 2026    12   18      0
101    IN      d 2027    13   27      0
102    IN      d 2028     1   59      0
103    IN      d 2029     8   27      0
104    IN      d 2030     6   65      0
105    IN      d 2031    13   43      1
106    IN      d 2032     7   49      0
107    IN      d 2033    12   45      0
108    IN      d 2034     7   55      0
109    IN      d 2035     7    6      0
110    IN      d 2036    11   29      0
111    IN      d 2037    12   66      0
112    IN      d 2038     2   38      1
113    IN      d 2039     2   27      0
114    IN      d 2040     6   44      0
115    IN      d 2041     5   50      0
116    IN      d 2042     6   61      0
117    IN      d 2043     8   33      0
118    IN      d 2044    10   43      0
119    IN      d 2045     5   31      1
120    IN      d 2046     5   42      0
121    IN      e 2017    13   30      1
122    IN      e 2018    10   58      0
123    IN      e 2019    12   75      0
124    IN      e 2020     7    2      0
125    IN      e 2021     4   29      0
126    IN      e 2022     9   18      0
127    IN      e 2023    13   24      0
128    IN      e 2024     7   74      1
129    IN      e 2025     7   70      0
130    IN      e 2026     2   29      0
131    IN      e 2027     5   64      0
132    IN      e 2028     2   34      0
133    IN      e 2029     3   48      0
134    IN      e 2030     9   37      0
135    IN      e 2031     8   67      1
136    IN      e 2032    10    8      0
137    IN      e 2033     4   56      0
138    IN      e 2034     2   38      0
139    IN      e 2035     2    3      0
140    IN      e 2036     3   35      0
141    IN      e 2037     8   67      0
142    IN      e 2038     5   69      1
143    IN      e 2039     2   44      0
144    IN      e 2040     6   63      0
145    IN      e 2041     7   51      0
146    IN      e 2042     1   16      0
147    IN      e 2043     7   60      0
148    IN      e 2044     6   36      0
149    IN      e 2045    12   17      1
150    IN      e 2046     1   26      0
151    IN      f 2017     4    1      1
152    IN      f 2018     1   71      0
153    IN      f 2019     4   55      0
154    IN      f 2020     5   34      0
155    IN      f 2021     4   69      0
156    IN      f 2022    11   74      0
157    IN      f 2023     1   13      0
158    IN      f 2024     9    8      1
159    IN      f 2025     8   18      0
160    IN      f 2026    13   59      0
161    IN      f 2027     2   14      0
162    IN      f 2028    12   64      0
163    IN      f 2029     1   71      0
164    IN      f 2030     2   59      0
165    IN      f 2031    12    6      1
166    IN      f 2032    10   21      0
 [ reached 'max' / getOption("max.print") -- omitted 14 rows ]

BUT flagged years (1 and 0 in the lime_y column) should be different for each state.

Any help or suggestion would be appreciated.Thanks!

Sean McKenzie · Accepted Answer

Looks like you have a few syntax issues in your two loops.

In your first loop, you are getting an average of all counties because you have nload<-mean(cty), but your indexing variable is i. By the way, you have specified that cty<-unique(county), meaning that cty is a character vector of the six counties. That will simply return

cty
[1] a, b, c, d, e, f

Also, the way you have the code now, each variable is a separate vector, so your indexing variable will select by element number (i.e., when i = 5, the loop will select the fifth element of the vector cty). I think what you want here is to create a dataframe first, and then feed that to your loop

##Your example data
state<-c(rep("IL",90),rep("IN",90))
county <-c(rep("a",30),rep("b",30),rep("c",30),rep("d",30),rep("e",30),rep("f",30))
year <- rep(c(2017:2046), 6)
nfert <- c(sample(20,90,replace = TRUE),sample(13,90,replace = TRUE))
sand <- c(sample(100,90,replace = TRUE),sample(75,90,replace=TRUE))


DF<-data.frame(State=state, County=county, Nfert=nfert, Sand = sand, Period = year) #Using values from your first code block

fertlim<-180

cty<-unique(DF$County)
tmp<-list()

for(i in 1:length(cty)){
  subs<-subset(DF, County==cty[i])
  nload<-mean(subs$Nfert)
  lime_int<-as.integer(fertlim/nload)
  print(lime_int)
  lime_yr <- seq.int(t0,tf,by=lime_int)
  subs[, "Flag"]<-0
  subs$Flag[subs$Period %in% lime_yr]<-1
  tmp[[i]]<-subs
}

OUT<-do.call(rbind, tmp)

The output will look like this

> head(OUT)
  State County Nfert Sand Period Flag
1    IL      a    18   18   2017    1
2    IL      a    18    4   2018    0
3    IL      a    18   49   2019    0
4    IL      a     1   82   2020    0
5    IL      a    15   82   2021    0
6    IL      a     9   19   2022    0

The second loop has a similar problem where again you are attempting to select with cty (which is a character vector), but your indexing variable is i. I think the solution I gave to the first loop should work for you.

Take Care,

-Sean

R - select years within a data frame based on calculated values from another columns

Answers (1)

Related Questions