prayner
prayner

Reputation: 415

For each row in a df, extract the observation which has the nearest value

Here is my sample data:

  iso3c production reported_by_FAO
   <chr>      <dbl> <chr>          
 1 NOR    10740117. Yes            
 2 TKM    10726360. No             
 3 SYR    10559247. Yes            
 4 ISR    10317065. No            
 5 DOM    10152261. No            
 6 TJK     9741324. Yes            
 7 YEM     9554599. No             

I want to mutate a new column that shows the iso3c that is reported by FAO (ie reported_by_FAO == "Yes") which has the nearest production value for each row.

Using the sample data created above, this is what the output would look like.

  iso3c production reported_by_FAO nearest_iso3c
   <chr>      <dbl> <chr>          <chr>
 1 NOR    10740117. Yes            NOR
 2 TKM    10726360. No             NOR
 3 SYR    10559247. Yes            SYR
 4 ISR    10317065. No             SYR
 5 DOM    10152261. No             SYR
 6 TJK     9741324. Yes            TJK
 7 YEM     9554599. No             TJK

Upvotes: 1

Views: 48

Answers (1)

akrun
akrun

Reputation: 886998

If the value of 'Yes' have any indication, then get the cumulative sum based on creating a logical vector with 'Yes' and use that as a grouping variable and create the 'nearest' as the first value of 'iso3c'

library(dplyr)
df1 %>%
   group_by(grp = cumsum(reported_by_FAO == 'Yes')) %>% 
   mutate(nearest = first(iso3c)) %>%
   ungroup %>%
   select(-grp)

-output

# A tibble: 7 x 4
#  iso3c production reported_by_FAO nearest
#  <chr>      <dbl> <chr>           <chr>  
#1 NOR     10740117 Yes             NOR    
#2 TKM     10726360 No              NOR    
#3 SYR     10559247 Yes             SYR    
#4 ISR     10317065 No              SYR    
#5 DOM     10152261 No              SYR    
#6 TJK      9741324 Yes             TJK    
#7 YEM      9554599 No              TJK    

Update

Based on the comments, we can extract the 'No', 'Yes' elements separately, use findInterval to get the index and return the corresponding 'iso3c'

df1$nearest <- df1$iso3c
ino <- df1$reported_by_FAO == 'No'
iyes <- df1$reported_by_FAO == 'Yes'
df1$nearest[ino] <- sapply(df1$production[ino], function(x) {
     val <- df1$production[iyes]
     df1$iso3c[iyes][order(val)][findInterval(x, val[order(val)]) + 1]
   })

-output

df1
#  iso3c production reported_by_FAO nearest
#1   NOR   10740117             Yes     NOR
#2   TKM   10726360              No     NOR
#3   SYR   10559247             Yes     SYR
#4   ISR   10317065              No     SYR
#5   DOM   10152261              No     SYR
#6   TJK    9741324             Yes     TJK
#7   YEM    9554599              No     TJK

Or another option is to get the absolute difference and find the index with which.min

df1$nearest[ino] <- sapply(df1$production[ino], function(x) {
     val <- df1$production[iyes]
     df1$iso3c[iyes][which.min(abs(val -x))]
   })

data

df1 <- structure(list(iso3c = c("NOR", "TKM", "SYR", "ISR", "DOM", "TJK", 
"YEM"), production = c(10740117, 10726360, 10559247, 10317065, 
10152261, 9741324, 9554599), reported_by_FAO = c("Yes", "No", 
"Yes", "No", "No", "Yes", "No")), class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6", "7"))

Upvotes: 2

Related Questions