Tiernan
Tiernan

Reputation: 858

Programmatically create new variables using purrr?

Intro

After recently taking Hadley Wickham's functional programming class I decided I'd try applying some of the lessons to my projects at work. Naturally, the first project I tried has proven to be more complicated than the examples worked demonstrated in the class. Does anyone have recommendations for a way to use the purrr package to make the task described below more efficient?

Project Background

I need to assign quintile groups to records in a spatial polygon dataframe. In addition to the record identifier there are several other variables and I need to calculate the quintile group for each.

Here's the crux of the problem: I have been asked to identify outliers in one particular variable and to omit those records from the entire analysis as long as it doesn't change the quintile composition of the first quintile group for any of the other variables.

Question

I have put together a dplyr pipeline (see the example below) that performs this checking process for a single variable, but how might I rewrite this process so that I can efficiently check each variable?

EDIT: While it is certainly possible to change the shape of the data from wide to long as an intermediary step, in the end it needs to return to its wide format so that it matches up with the @polygons slot of the spatial polygons dataframe.

Reproducible Example

You can find the complete script here: https://gist.github.com/tiernanmartin/6cd3e2946a77b7c9daecb51aa11e0c94

Libraries and Settings

library(grDevices)      # boxplot.stats()
library(operator.tools) # %!in% logical operator
library(tmap)           # 'metro' data set
library(magrittr)       # piping
library(dplyr)          # exploratory data analysis verbs
library(purrr)          # recursive mapping of functions 
library(tibble)         # improved version of a data.frame
library(ggplot2)        # dot plot
library(ggrepel)        # avoid label overlap

options(scipen=999)
set.seed(888)

Load the example data and take a small sample of it

data("metro")
    
m_spdf <- metro
# Take a sample
m <-  
        metro@data %>% 
        as_tibble %>% 
        select(-name_long,-iso_a3) %>% 
        sample_n(50)

> m
# A tibble: 50 x 10
          name pop1950 pop1960 pop1970 pop1980 pop1990
         <chr>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
1       Sydney 1689935 2134673 2892477 3252111 3631940
2       Havana 1141959 1435511 1779491 1913377 2108381
3     Campinas  151977  293174  540430 1108903 1693359
4         Kano  123073  229203  541992 1349646 2095384
5         Omsk  444326  608363  829860 1032150 1143813
6  Ouagadougou   33035   59126  115374  265200  537441
7    Marseille  755805  928768 1182048 1372495 1418279
8      Taiyuan  196510  349535  621625 1105695 1636599
9       La Paz  319247  437687  600016  809218 1061850
10   Baltimore 1167656 1422067 1554538 1748983 1848834
# ... with 40 more rows, and 4 more variables:
#   pop2000 <dbl>, pop2010 <dbl>, pop2020 <dbl>,
#   pop2030 <dbl>

Calculate quintile groups with and without outlier records

# Calculate the quintile groups for one variable (e.g., `pop1990`)
m_all <- 
        m %>% 
        mutate(qnt_1990_all = dplyr::ntile(pop1990,5))

# Find the outliers for a different variable (e.g., 'pop1950')
# and subset the df to exlcude these outlier records
m_out <- boxplot.stats(m$pop1950) %>% .[["out"]]
        
m_trim <- 
        m %>% 
        filter(pop1950 %!in% m_out) %>% 
        mutate(qnt_1990_trim = dplyr::ntile(pop1990,5))

# Assess whether the outlier trimming impacted the first quintile group
m_comp <- 
        m_trim %>% 
        select(name,dplyr::contains("qnt")) %>% 
        left_join(m_all,.,"name") %>% 
        select(name,dplyr::contains("qnt"),everything()) %>% 
        mutate(qnt_1990_chng_lgl = !is.na(qnt_1990_trim) & qnt_1990_trim != qnt_1990_all,
               qnt_1990_chng_dir = if_else(qnt_1990_chng_lgl,
                                           paste0(qnt_1990_all," to ",qnt_1990_trim),
                                           "No change"))

With a little help from ggplot2, I can see that in this example six outliers were identified and that their omission did not affect the first quintile group for pop1990.

Records impacted by the removal of outliers

Importantly, this information is tracked in two new variables: qnt_1990_chng_lgl and qnt_1990_chng_dir.

> m_comp %>% select(name,qnt_1990_chng_lgl,qnt_1990_chng_dir,everything())
# A tibble: 50 x 14
          name qnt_1990_chng_lgl qnt_1990_chng_dir qnt_1990_all qnt_1990_trim
         <chr>             <lgl>             <chr>        <dbl>         <dbl>
1       Sydney             FALSE         No change            5            NA
2       Havana              TRUE            4 to 5            4             5
3     Campinas              TRUE            3 to 4            3             4
4         Kano             FALSE         No change            4             4
5         Omsk             FALSE         No change            3             3
6  Ouagadougou             FALSE         No change            1             1
7    Marseille             FALSE         No change            3             3
8      Taiyuan              TRUE            3 to 4            3             4
9       La Paz             FALSE         No change            2             2
10   Baltimore             FALSE         No change            4             4
# ... with 40 more rows, and 9 more variables: pop1950 <dbl>, pop1960 <dbl>,
#   pop1970 <dbl>, pop1980 <dbl>, pop1990 <dbl>, pop2000 <dbl>, pop2010 <dbl>,
#   pop2020 <dbl>, pop2030 <dbl>

I now need to find a way to repeat this process for every variable in the dataframe (i.e., pop1960 - pop2030). Ideally, two new variables would be created for each existing pop* variable and their names would be preceded by qnt_ and followed by either _chng_dir or _chng_lgl.

Is purrr the right tool to use for this? dplyr::mutate_? data.table?

Upvotes: 2

Views: 557

Answers (2)

Tiernan
Tiernan

Reputation: 858

It turns out this problem is solvable using tidyr::gather + dplyr::group_by + tidyr::spread functions. While @shayaa and @Gregor didn't provide the solution I was looking for, their advice helped me course-correct away from the functional programming methods I was researching.

I ended up using @shayaa's gather and group_by combination, followed by mutate to create the variable names (qnt_*_chng_lgl and qnt_*_chng_dir) and then using spread to make it wide again. An anonymous function passed to summarize_all removed all the extra NA's that the wide-long-wide transformations created.

m_comp <- 
        m %>% 
        mutate(qnt = dplyr::ntile(pop1950,5)) %>%
        filter(pop1950 %!in% m_out) %>% 
        gather(year,pop,-name,-qnt) %>% 
        group_by(year) %>% 
        mutate(qntTrim = dplyr::ntile(pop,5),
               qnt_chng_lgl = !is.na(qnt) & qnt != qntTrim,
               qnt_chng_dir = ifelse(qnt_chng_lgl,
                                     paste0(qnt," to ",qntTrim),
                                     "No change"),
               year_lgl = paste0("qnt_chng_",year,"_lgl"),
               year_dir = paste0("qnt_chng_",year,"_dir")) %>% 
        spread(year_lgl,qnt_chng_lgl) %>% 
        spread(year_dir,qnt_chng_dir) %>% 
        spread(year,pop) %>% 
        select(-qnt,-qntTrim) %>% 
        group_by(name) %>% 
        summarize_all(function(.){subset(.,!is.na(.)) %>% first}) 

Upvotes: 3

shayaa
shayaa

Reputation: 2797

Nothing wrong with your analysis it seems to me,

After this part

m <-  metro@data %>% 
  as_tibble %>% 
  select(-name_long,-iso_a3) %>% 
  sample_n(50)

Just melt your data and continue your analysis but with group_by(year)

library(reshape2)
library(stringr)
mm <- melt(m)
mm[,2] <- as.factor(str_sub(mm[,2],-4))
names(mm)[2:3] <- c("year", "population")

e.g.,

mm %>% group_by(year) %>%
+   mutate(qnt_all = dplyr::ntile(population,5))

Upvotes: 2

Related Questions