Rushabh Patel
Rushabh Patel

Reputation: 2764

Changing Data types of dataframe columns based on template with matching columns in R

I have 2 dataframes.

I want to change data types of second dataframe based on first. Lets suppose I have below data frame which I am using as template.

> template
id <- c(1,2,3,4)
a <- c(1,4,5,6)
b <- as.character(c(0,1,1,4))
c <- as.character(c(0,1,1,0))
d <- c(0,1,1,0)
template <- data.frame(id,a,b,c,d, stringsAsFactors = FALSE)

> str(template)
'data.frame':   4 obs. of  5 variables:
 $ id: num  1 2 3 4
 $ a : num  1 4 5 6
 $ b : chr  "0" "1" "1" "4"
 $ c : chr  "0" "1" "1" "0"
 $ d : num  0 1 1 0

I am looking for below things.

**Note- It should add additional columns with all NA's if not available in df.

> df
id <- c(6,7,12,14,1,3,4,4)
a <- c(0,1,13,1,3,4,5,6)
b <- c(1,4,12,3,4,5,6,7)
c <- c(0,0,13,3,4,45,6,7)
e <- c(0,0,13,3,4,45,6,7)
df <- data.frame(id,a,b,c,e)

> str(df)
'data.frame':   8 obs. of  5 variables:
 $ id: num  6 7 12 14 1 3 4 4
 $ a : num  0 1 13 1 3 4 5 6
 $ b : num  1 4 12 3 4 5 6 7
 $ c : num  0 0 13 3 4 45 6 7
 $ e : num  0 0 13 3 4 45 6 7

Desired output-

> output
    id  a  b  c  d
    1  6  0  1  0 NA
    2  7  1  4  0 NA
    3 12 13 12 13 NA
    4 14  1  3  3 NA
    5  1  3  4  4 NA
    6  3  4  5 45 NA
    7  4  5  6  6 NA
    8  4  6  7  7 NA

> str(output)

'data.frame':   8 obs. of  5 variables:
 $ id: num  6 7 12 14 1 3 4 4
 $ a : num  0 1 13 1 3 4 5 6
 $ b : chr  "1" "4" "12" "3" ...
 $ c : chr  "0" "0" "13" "3" ...
 $ d : logi  NA NA NA NA NA NA ...

My attempts-

template <- fread("template.csv"),header=TRUE,stringsAsFactors = FALSE)
n <- names(template)
template[,(n) :=  lapply(.SD,function(x) gsub("[^A-Za-z0-90 _/.-]","", as.character(x)))]
n <- names(df)
df[,(n) :=  lapply(.SD,function(x) gsub("[^A-Za-z0-90 _/.-]","", as.character(x)))]
output <- rbindlist(list(template,df),use.names = TRUE,fill = TRUE,idcol="template")

After this I write the output data frame and then reread using write.csv to get data types. But, I am messing up with data types. Please suggest any appropriate way to deal with it.

Upvotes: 5

Views: 630

Answers (2)

Frank
Frank

Reputation: 66819

I'd do

res = data.frame(
  lapply(setNames(,names(template)), function(x) 
    if (x %in% names(df)) as(df[[x]], class(template[[x]])) 
    else template[[x]][NA_integer_]
  ), stringsAsFactors = FALSE)

or with magrittr

library(magrittr)

setNames(, names(template)) %>% 
  lapply(. %>% {
    if (. %in% names(df)) as(df[[.]], class(template[[.]])) 
    else template[[.]][NA_integer_]
  }) %>% data.frame(stringsAsFactors = FALSE)

verifying ...

'data.frame':   8 obs. of  5 variables:
 $ id: num  6 7 12 14 1 3 4 4
 $ a : num  0 1 13 1 3 4 5 6
 $ b : chr  "1" "4" "12" "3" ...
 $ c : chr  "0" "0" "13" "3" ...
 $ d : num  NA NA NA NA NA NA NA NA

I'd suggest looking at the vetr package if you're going to be doing a lot of stuff like this. It has a good approach to templates for data frames and their columns.

Upvotes: 2

Curt F.
Curt F.

Reputation: 4824

Here's some code that does what you want.

require(tidyverse)

new_types <-
    map_df(template, class) %>% 
    t %>%
    as.data.frame(stringsAsFactors = F) %>%
    rownames_to_column %>%
    setNames(c('col', 'type'))

new_data <- df %>%
    gather(col, value) %>%
    right_join(new_types, by='col') %>%
    group_by(col) %>%
    mutate(rownum = row_number()) %>%
    ungroup %>%
    complete(col, rownum=1:max(rownum)) %>%
    group_by(col) %>%
    summarize(val = list(value), type=first(type)) %>%
    mutate(new_val = map2(val, type, ~as(.x, .y, strict = T))) %>%
    select(col, new_val) %>%
    spread(col, new_val) %>%
    unnest

The main idea here is to use map2() from the purrr package to apply the as() function from base R. This function takes in an object (e.g. a vector or column from a dataframe) and a character string that describes a new type, and returns the coerced object. This is the core capability that you need.

My new_types dataframe just lists the column names of the template and the (character-string) named of their type in a data frame.

Except for the map2() line, everything else is mucky data wrangling that could probably be improved.

Some key features:

  • right_join here is essential to keep only the columns you want.
  • the lines from mutate(rownum = row_number()) to complete(col, rownum=1:max(rownum)) are necessary only when the target df has columns that aren't in template -- they make sure that the resulting number of NAs is the same as for the other columns.

Upvotes: 1

Related Questions