Reputation: 31
I have a data frame that has 10 columns and "n" rows (a lot of rows).
The idea of the data frame is simple: it gets the market expectation for the exchange rate for the date "t" at a time "d". For example: today we have the market expectation for the exchange rate at january/23, february/23, and so on.... (this data frame brings the expectations on a monthly basis, with daily updates)
This data frame has the historical estimates since january/2019, for dates up to december/2023.
So to summarize, we have a "date_of_estimate" column, and an "estimation_reference" column.
The thing is, I want to filter this huge data frame to get the most updated value for all the monthly estimates since 01-01-2019.
So the code should work just as a maxif function, where it gets the highest value of the "date_of_estimate" column based on the "estimation_reference" value. The "estimation_reference" can be also interpreted as a string, like "Group_A","Group_B", etc...
How do I get the structure im looking for? I'm not very familiar to R and this is an important work routine that has just fallen on my lap...
Thanks in advance
My first guess was to use the aggregate function, the code I used was this one:
`Cambio_PorDataRef = aggregate(base_cambio, by = list(base_cambio$Data), max)`
Where base_cambio is the raw data frame containing all the dates and estimates, base_cambio$data is the "date_of_estimate" column I mentioned above.
The result is: enter image description here
The "data_referencia" column should be composed by unique values, where the "date_of_estimate" (Group 1 column in the image) should be the most updated (latest date available for this estimate) but it is bringing repeated values, and the values dont seem to make sense, as it should begin in 01/2021 and progress month by month until 12/2023 (i.e dec/23).
By running dput(head(base_cambio,20)) I got:
structure(list(Indicador = c("Câmbio", "Câmbio", "Câmbio", "Câmbio",
"Câmbio", "Câmbio", "Câmbio", "Câmbio", "Câmbio", "Câmbio", "Câmbio",
"Câmbio", "Câmbio", "Câmbio", "Câmbio", "Câmbio", "Câmbio", "Câmbio",
"Câmbio", "Câmbio"), Data = structure(c(18655, 18654, 18653,
18652, 18649, 18648, 18647, 18646, 18645, 18642, 18641, 18640,
18639, 18638, 18635, 18634, 18633, 18632, 18631, 18683), class = "Date"),
DataReferencia = c("01/2021", "01/2021", "01/2021", "01/2021",
"01/2021", "01/2021", "01/2021", "01/2021", "01/2021", "01/2021",
"01/2021", "01/2021", "01/2021", "01/2021", "01/2021", "01/2021",
"01/2021", "01/2021", "01/2021", "02/2021"), Media = c(5.3,
5.29, 5.29, 5.29, 5.28, 5.25, 5.25, 5.25, 5.24, 5.24, 5.22,
5.21, 5.21, 5.19, 5.17, 5.14, 5.14, 5.13, 5.13, 5.38), Mediana = c(5.3,
5.3, 5.3, 5.3, 5.3, 5.25, 5.25, 5.25, 5.25, 5.25, 5.21, 5.2,
5.2, 5.16, 5.15, 5.15, 5.15, 5.14, 5.13, 5.4), DesvioPadrao = c(0.11,
0.11, 0.11, 0.11, 0.11, 0.1, 0.1, 0.1, 0.1, 0.11, 0.14, 0.14,
0.15, 0.15, 0.15, 0.14, 0.13, 0.13, 0.13, 0.07), Minimo = c(4.85,
4.85, 4.85, 4.85, 4.85, 4.85, 4.85, 4.85, 4.85, 4.85, 4.85,
4.85, 4.85, 4.85, 4.85, 4.85, 4.85, 4.85, 4.85, 5), Maximo = c(5.62,
5.62, 5.5, 5.5, 5.5, 5.5, 5.5, 5.5, 5.5, 5.5, 5.49, 5.49,
5.6, 5.6, 5.6, 5.6, 5.6, 5.6, 5.6, 5.52), numeroRespondentes = c(102L,
102L, 100L, 99L, 99L, 95L, 97L, 97L, 97L, 98L, 92L, 92L,
90L, 89L, 90L, 91L, 90L, 90L, 89L, 107L), baseCalculo = c(0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L)), row.names = c(NA, -20L), class = c("tbl_df",
"tbl", "data.frame"))
The data frame looks like this:
The function is expected to look for the highest (latest) value on the second column (Data
) for every unique occurence in the third column (DataReferência
). As this data frame only has one unique value, the entire first row should be the ouput of the code im looking for, as it has the highest value on the second column. The code should be able to do the same for every unique value on the third column and gather it all in a new data frame, filtered, with all the columns of the original data frame.
Output should be:
Upvotes: 3
Views: 934
Reputation: 17284
You want dplyr::slice_max()
:
library(dplyr)
base_cambio_recent <- base_cambio %>%
group_by(DataReferencia) %>%
slice_max(Data) %>%
ungroup()
Or a base R approach:
base_cambio_recent <- base_cambio[rev(order(base_cambio$Data)), ]
base_cambio_recent <- lapply(
split(base_cambio_recent, base_cambio_recent$DataReferencia),
\(x) head(x, 1)
)
base_cambio_recent <- do.call(rbind, base_cambio_recent)
Result from either approach:
# A tibble: 2 × 10
Indicador Data DataReferencia Media Mediana DesvioPadrao Minimo Maximo numeroRespondentes baseCalculo
<chr> <date> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int>
1 Câmbio 2021-01-28 01/2021 5.3 5.3 0.11 4.85 5.62 102 0
2 Câmbio 2021-02-25 02/2021 5.38 5.4 0.07 5 5.52 107 0
Upvotes: 3