Matheus Cruz
Matheus Cruz

Reputation: 31

Filter a data frame based on the maximum value of a column

I have a data frame that has 10 columns and "n" rows (a lot of rows).

The idea of the data frame is simple: it gets the market expectation for the exchange rate for the date "t" at a time "d". For example: today we have the market expectation for the exchange rate at january/23, february/23, and so on.... (this data frame brings the expectations on a monthly basis, with daily updates)

This data frame has the historical estimates since january/2019, for dates up to december/2023.

So to summarize, we have a "date_of_estimate" column, and an "estimation_reference" column.

The thing is, I want to filter this huge data frame to get the most updated value for all the monthly estimates since 01-01-2019.

So the code should work just as a maxif function, where it gets the highest value of the "date_of_estimate" column based on the "estimation_reference" value. The "estimation_reference" can be also interpreted as a string, like "Group_A","Group_B", etc...

How do I get the structure im looking for? I'm not very familiar to R and this is an important work routine that has just fallen on my lap...

Thanks in advance

My first guess was to use the aggregate function, the code I used was this one:

`Cambio_PorDataRef = aggregate(base_cambio, by = list(base_cambio$Data), max)`

Where base_cambio is the raw data frame containing all the dates and estimates, base_cambio$data is the "date_of_estimate" column I mentioned above.

The result is: enter image description here

The "data_referencia" column should be composed by unique values, where the "date_of_estimate" (Group 1 column in the image) should be the most updated (latest date available for this estimate) but it is bringing repeated values, and the values dont seem to make sense, as it should begin in 01/2021 and progress month by month until 12/2023 (i.e dec/23).

By running dput(head(base_cambio,20)) I got:

structure(list(Indicador = c("Câmbio", "Câmbio", "Câmbio", "Câmbio", 
"Câmbio", "Câmbio", "Câmbio", "Câmbio", "Câmbio", "Câmbio", "Câmbio", 
"Câmbio", "Câmbio", "Câmbio", "Câmbio", "Câmbio", "Câmbio", "Câmbio", 
"Câmbio", "Câmbio"), Data = structure(c(18655, 18654, 18653, 
18652, 18649, 18648, 18647, 18646, 18645, 18642, 18641, 18640, 
18639, 18638, 18635, 18634, 18633, 18632, 18631, 18683), class = "Date"), 
    DataReferencia = c("01/2021", "01/2021", "01/2021", "01/2021", 
    "01/2021", "01/2021", "01/2021", "01/2021", "01/2021", "01/2021", 
    "01/2021", "01/2021", "01/2021", "01/2021", "01/2021", "01/2021", 
    "01/2021", "01/2021", "01/2021", "02/2021"), Media = c(5.3, 
    5.29, 5.29, 5.29, 5.28, 5.25, 5.25, 5.25, 5.24, 5.24, 5.22, 
    5.21, 5.21, 5.19, 5.17, 5.14, 5.14, 5.13, 5.13, 5.38), Mediana = c(5.3, 
    5.3, 5.3, 5.3, 5.3, 5.25, 5.25, 5.25, 5.25, 5.25, 5.21, 5.2, 
    5.2, 5.16, 5.15, 5.15, 5.15, 5.14, 5.13, 5.4), DesvioPadrao = c(0.11, 
    0.11, 0.11, 0.11, 0.11, 0.1, 0.1, 0.1, 0.1, 0.11, 0.14, 0.14, 
    0.15, 0.15, 0.15, 0.14, 0.13, 0.13, 0.13, 0.07), Minimo = c(4.85, 
    4.85, 4.85, 4.85, 4.85, 4.85, 4.85, 4.85, 4.85, 4.85, 4.85, 
    4.85, 4.85, 4.85, 4.85, 4.85, 4.85, 4.85, 4.85, 5), Maximo = c(5.62, 
    5.62, 5.5, 5.5, 5.5, 5.5, 5.5, 5.5, 5.5, 5.5, 5.49, 5.49, 
    5.6, 5.6, 5.6, 5.6, 5.6, 5.6, 5.6, 5.52), numeroRespondentes = c(102L, 
    102L, 100L, 99L, 99L, 95L, 97L, 97L, 97L, 98L, 92L, 92L, 
    90L, 89L, 90L, 91L, 90L, 90L, 89L, 107L), baseCalculo = c(0L, 
    0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
    0L, 0L, 0L, 0L)), row.names = c(NA, -20L), class = c("tbl_df", 
"tbl", "data.frame"))

The data frame looks like this:

First 6 rows of the raw data

The function is expected to look for the highest (latest) value on the second column (Data) for every unique occurence in the third column (DataReferência). As this data frame only has one unique value, the entire first row should be the ouput of the code im looking for, as it has the highest value on the second column. The code should be able to do the same for every unique value on the third column and gather it all in a new data frame, filtered, with all the columns of the original data frame.

Output should be:

Desired output for the example

Upvotes: 3

Views: 934

Answers (1)

zephryl
zephryl

Reputation: 17284

You want dplyr::slice_max():

library(dplyr)

base_cambio_recent <- base_cambio %>% 
  group_by(DataReferencia) %>% 
  slice_max(Data) %>% 
  ungroup()

Or a base R approach:

base_cambio_recent <- base_cambio[rev(order(base_cambio$Data)), ] 
base_cambio_recent <- lapply(
  split(base_cambio_recent, base_cambio_recent$DataReferencia),
  \(x) head(x, 1)
)
base_cambio_recent <- do.call(rbind, base_cambio_recent)

Result from either approach:

# A tibble: 2 × 10
  Indicador Data       DataReferencia Media Mediana DesvioPadrao Minimo Maximo numeroRespondentes baseCalculo
  <chr>     <date>     <chr>          <dbl>   <dbl>        <dbl>  <dbl>  <dbl>              <int>       <int>
1 Câmbio    2021-01-28 01/2021         5.3      5.3         0.11   4.85   5.62                102           0
2 Câmbio    2021-02-25 02/2021         5.38     5.4         0.07   5      5.52                107           0

Upvotes: 3

Related Questions