fixing a messy dataframe with tidyr in R

Question

I have a dataset with observations about households; within each household there are individuals. The number of individuals per household differs. Households are identified with an id and members of the household are identified according to the order they were interviewed. So if household 1 had 4 members, the variable id is the same across all of them, but variable order goes from 1 to 4. The problem I have is that, for some variables, only the first member of the household answered for the rest of the members; therefore I have a mixture of long and wide format within my dataset.

What I need to do is to assign to the correspondent members of the household the values that were answered by the first member of the household. To explain further the structure of my data I´ll give the following toy example:

  df <- data.frame( id = c(rep(1,4), rep(2,5)), order = c(1:4,1:5), 
              age = c(54,20,23,17, 60,57,28,33,19), 
              educDebt1 = c(1, NA, NA, NA, 3, NA, NA, NA, NA), 
              educDebt2 = c(3, NA, NA, NA, 5, NA, NA, NA, NA), 
              educDebt3 = c(NA, NA, NA, NA, 4, NA, NA, NA, NA),
              educDebt1t = c("student loan", NA,NA,NA, 
                         "student loan", NA, NA, NA, NA),
              educDebt2t = c("student fund", NA, NA, NA, 
                         "bank credit", NA, NA, NA, NA),
              educdebt3t = c(NA, NA, NA, NA, 
                         "bank credit", NA, NA, NA, NA),
              educDebt1t_r = c("yes", NA,NA,NA, "no",NA,NA,NA,NA),
              educDebt2t_r = c("no", NA, NA, NA, "no", NA,NA,NA,NA),
              educDebt3t_r = c(NA,NA,NA,NA, "yes", NA,NA,NA,NA),

              bankDebt1 = c(1, NA, NA, NA, 3, NA, NA, NA, NA), 
              bankDebt2 = c(4, NA, NA, NA, 2, NA, NA, NA, NA), 
              bankDebt3 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA),
              bankDebt1t = c("car loan", NA,NA,NA, 
                             "consumer loan", NA, NA, NA, NA),
              bankDebt2t = c("car loan", NA, NA, NA, 
                             "car loan", NA, NA, NA, NA),
              bankdebt3t = c(NA, NA, NA, NA, NA, NA, NA, NA, NA),
              bankDebt1t_r = c("yes", NA,NA,NA, "yes",NA,NA,NA,NA),
              bankDebt2t_r = c("no", NA, NA, NA, "no", NA,NA,NA,NA),
              bankDebt3t_r = c(NA,NA,NA,NA, NA, NA,NA,NA,NA))

I only show some of the columns, for not cluttering the page.

id order age educDebt1 educDebt2 educDebt3   educDebt1t   educDebt2t  educdebt3A
 1     1  54         1         3        NA student loan student fund  NA
 1     2  20        NA        NA        NA           NA         NA    NA
 1     3  23        NA        NA        NA           NA         NA    NA
 1     4  17        NA        NA        NA           NA         NA    NA
 2     1  60         3         5         4 student loan  bank credit  bank credit
 2     2  57        NA        NA        NA           NA         NA    NA
 2     3  28        NA        NA        NA           NA         NA    NA
 2     4  33        NA        NA        NA           NA         NA    NA
 2     5  19        NA        NA        NA           NA         NA    NA

In the toy example from above, I have a household level variable id and individual level variables: order corresponds to the order of the individual in the household; age is their age. The other variables correspond to debts. A household can report at most three debts for each type of debt. In this case there are two types of debt, educational debt educDebt or bank debt bankdebt(only one type is shown above).

So in each household, only the member corresponding to order == 1 answer for the rest of the members in the household. In educDebt1 till educDebt3, the value corresponds to the member of the household with the debt, therefore, if we take a look at the first row, it says that household member 1 of household 1 has an educational debt, as well as household member 3. Then, from educDebt1t to educDebt3t, it tells which type of debt the household member has. In household 2, three are the members with debts, household members: 3, 5 and 4.

Then we have another type of debt, bank debt, and the logic is the same as before.

What I want to accomplish, is to have every member of the household and their debts in a row, something like this:

id order age      educDebt     educDebt_r      bankDebt     bankDebt_r
 1     1  54   student loan        yes         car loan          yes
 1     2  20        NA              NA               NA           NA
 1     3  23   student fund         no               NA           NA
 1     4  17        NA              NA         car loan           no
 2     1  60        NA              NA               NA           NA
 2     2  57        NA              NA         car loan           no
 2     3  28   student loan         no    consumer loan          yes
 2     4  33    bank credit        yes               NA           NA
 2     5  19    bank credit         no               NA           NA

For accomplishing this I actually divided the data in different tables, one with the first three variables, and others for each type of debt. For the debt tables I only kept the row of the interviewed member, and reshape the data to long format so each row became a household member, and then I merged the tables by household and household member id, but there are many debt types, and my aproach is quite inefficient. Is there a way I could achieve the same result with the tidyr package?

My approach was the following:

First, I created three data frames, that extracted different column indexes for each row. I did it with a for loop.

newdf1 <- data.frame()
ind <- c(1,seq(4,19, 3))
for(j in 1:nrow(df)){
    fila <- c()
    for(i in 1:length(ind)){
            dato <- as.character(df[j,ind[i]])
            fila <- c(fila, dato)

    }
    newdf1 <- rbind(newdf1, fila, stringsAsFactors = FALSE )
}

newdf2 <- data.frame()
ind <- c(1,seq(5,20, 3))
for(j in 1:nrow(df)){
    fila <- c()
    for(i in 1:length(ind)){
            dato <- as.character(df[j,ind[i]])
            fila <- c(fila, dato)

    }
    newdf2 <- rbind(newdf2, fila, stringsAsFactors = FALSE )
}

newdf3 <- data.frame()
ind <- c(1,seq(6,21, 3))
for(j in 1:nrow(df)){
    fila <- c()
    for(i in 1:length(ind)){
            dato <- as.character(df[j,ind[i]])
            fila <- c(fila, dato)

    }
    newdf3 <- rbind(newdf3, fila, stringsAsFactors = FALSE )
}

Then I rowbinded them:

NewDfs <- rbind(newdf1,setNames(newdf2, names(newdf1)), 
                            setNames(newdf3, names(newdf1)))

names(NewDfs ) <- c("id", "order", "educDebt", "educDebt_r",
               "order", "bankDebt", "bankDebt_r")

From this dataframe, I extracted the debts regarding education in one dataframe, and the debts regarding bank in another, keep only the compelte cases, and merge them together by id and order.

educ <- NewDfs [,c(1:4)]
bank <- NewDfs [,c(1,5:7)]
educ <- educ[complete.cases(educ), ]
bank <- bank[complete.cases(bank), ]

I also created a datarame with the first three columns of the original dataset.

df_household <- df[,1:3]

And merged it with the educ_bank data frame.

dfMerged <- merge(df_hog, educ_bank, by = c("id", "order"), all.x = TRUE)

 id order age     educDebt educDebt_r      bankDebt bankDebt_r
  1     1  54 student loan        yes      car loan        yes
  1     2  20                                 
  1     3  23 student fund         no                 
  1     4  17                      car loan         no
  2     1  60                                 
  2     2  57                      car loan         no
  2     3  28 student loan         no consumer loan        yes
  2     4  33  bank credit        yes                 
  2     5  19  bank credit         no

Evidently, this doen´t seem to be the most straightforward way of doing it, and I was wondering if there was a simplier way of achieving the same with tidyr.

r2evans · Accepted Answer

I don't have a solution that is completely tidyr (and dplyr), though perhaps somebody more familiar with it can assist. (There is room to include more of the tidyverse, specifically purrr, to replace some of the base R code, but I thought it unnecessary.) I'll walk through each step with the solution at the bottom.

Data

First, I think some of the columns are misnamed (lower-case "debt"), so I fixed it; that's not absolutely critical, but it makes some things much easier. I also disable factors, as some operations (on debt, below) require strings. If having factors is important, I suggest you re-factor after this process.

df <- data.frame(
  id = c(rep(1,4), rep(2,5)), order = c(1:4,1:5), 
  age = c(54,20,23,17, 60,57,28,33,19), 
  educDebt1 = c(1, NA, NA, NA, 3, NA, NA, NA, NA), 
  educDebt2 = c(3, NA, NA, NA, 5, NA, NA, NA, NA), 
  educDebt3 = c(NA, NA, NA, NA, 4, NA, NA, NA, NA),
  educDebt1t = c("student loan", NA,NA,NA, "student loan", NA, NA, NA, NA),
  educDebt2t = c("student fund", NA, NA, NA, "bank credit", NA, NA, NA, NA),
  educDebt3t = c(NA, NA, NA, NA, "bank credit", NA, NA, NA, NA),
  educDebt1t_r = c("yes", NA,NA,NA, "no",NA,NA,NA,NA),
  educDebt2t_r = c("no", NA, NA, NA, "no", NA,NA,NA,NA),
  educDebt3t_r = c(NA,NA,NA,NA, "yes", NA,NA,NA,NA),
  bankDebt1 = c(1, NA, NA, NA, 3, NA, NA, NA, NA), 
  bankDebt2 = c(4, NA, NA, NA, 2, NA, NA, NA, NA), 
  bankDebt3 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA),
  bankDebt1t = c("car loan", NA,NA,NA, "consumer loan", NA, NA, NA, NA),
  bankDebt2t = c("car loan", NA, NA, NA, "car loan", NA, NA, NA, NA),
  bankDebt3t = c(NA, NA, NA, NA, NA, NA, NA, NA, NA),
  bankDebt1t_r = c("yes", NA,NA,NA, "yes",NA,NA,NA,NA),
  bankDebt2t_r = c("no", NA, NA, NA, "no", NA,NA,NA,NA),
  bankDebt3t_r = c(NA,NA,NA,NA, NA, NA,NA,NA,NA),
  stringsAsFactors = FALSE
)
library(dplyr)
library(tidyr)

Step-through

Eventually, we're going to merge in age, and since all respondents are identified by both id and order, so we separate the three:

maintbl <- select(df, id, order, age)

The first thing to realize (for me) is that you need to convert from wide-to-tall, but individually for each three column group. I'll start with the first bunch of three:

grp <- "educDebt"
select(df, id, matches(paste0(grp, "[0-9]+$"))) %>%
  gather(debt, order, -id) %>%
  filter(! is.na(order)) %>%
  arrange(id, order)
#   id      debt order
# 1  1 educDebt1     1
# 2  1 educDebt2     3
# 3  2 educDebt1     3
# 4  2 educDebt3     4
# 5  2 educDebt2     5

(BTW 1: the reason I'm using grp will be apparent later.)
(BTW 2: I used the regex [0-9]+ to match one or more digit, in case this is expanded to include either more than 9 or "arbitrary" numbering. Feel free to omit the +.)

This seems fine. We now need to cbind the *t variant of these three:

select(df, id, matches(paste0(grp, "[0-9]+t$"))) %>%
  gather(debt, type, -id) %>%
  filter(! is.na(type)) %>%
  mutate(debt = gsub("t$", "", debt))
#   id      debt         type
# 1  1 educDebt1 student loan
# 2  2 educDebt1 student loan
# 3  1 educDebt2 student fund
# 4  2 educDebt2  bank credit
# 5  2 educDebt3  bank credit

I changed debt to remove the trailing t, as I'm going to use that as a merging column later. I do the same thing for the third group of three (for "educDebt"), the t_r columns.

These three columns need to be combined, so here I place them in a list and Reduce them:

Reduce(function(x,y) left_join(x, y, by = c("id", "debt")),
       list(
         select(df, id, matches(paste0(grp, "[0-9]+$"))) %>%
           gather(debt, order, -id) %>%
           filter(! is.na(order)) %>%
           arrange(id, order),
         select(df, id, matches(paste0(grp, "[0-9]+t$"))) %>%
           gather(debt, type, -id) %>%
           filter(! is.na(type)) %>%
           mutate(debt = gsub("t$", "", debt)),
         select(df, id, matches(paste0(grp, "[0-9]+t_r$"))) %>%
           gather(debt, r, -id) %>%
           filter(! is.na(r)) %>%
           mutate(debt = gsub("t_r$", "", debt))
       ))
#   id      debt order         type   r
# 1  1 educDebt1     1 student loan yes
# 2  1 educDebt2     3 student fund  no
# 3  2 educDebt1     3 student loan  no
# 4  2 educDebt3     4  bank credit yes
# 5  2 educDebt2     5  bank credit  no

I'll need to rename the last two columns, and since I'm done combining the type and r columns, I can drop debt. (I'd normally suggest dplyr::rename_, but since it is being deprecated shortly, I'm doing it manually. If you have significantly more columns than shown here, you may need to adjust the column numbering, etc.)

Lastly, we need to do this for each of "educDebt" and "bankDebt", join these by id and order (using another Reduce), and finally re-merge in the age.

TL;DR

Reduce(function(x,y) left_join(x, y, by = c("id", "order")),
       lapply(c("educDebt", "bankDebt"), function(grp) {
         ret <- Reduce(function(x,y) left_join(x, y, by = c("id", "debt")),
                       list(
                         select(df, id, matches(paste0(grp, "[0-9]+$"))) %>%
                           gather(debt, order, -id) %>%
                           filter(! is.na(order)) %>%
                           arrange(id, order),
                         select(df, id, matches(paste0(grp, "[0-9]+t$"))) %>%
                           gather(debt, type, -id) %>%
                           filter(! is.na(type)) %>%
                           mutate(debt = gsub("t$", "", debt)),
                         select(df, id, matches(paste0(grp, "[0-9]+t_r$"))) %>%
                           gather(debt, r, -id) %>%
                           filter(! is.na(r)) %>%
                           mutate(debt = gsub("t_r$", "", debt))
                       ))
         names(ret)[4:5] <- c(grp, paste0(grp, "_r"))
         select(ret, -debt)
       })
       ) %>%
  left_join(maintbl, ., by = c("id", "order"))
#   id order age     educDebt educDebt_r      bankDebt bankDebt_r
# 1  1     1  54 student loan        yes      car loan        yes
# 2  1     2  20                                 
# 3  1     3  23 student fund         no                 
# 4  1     4  17                                 
# 5  2     1  60                                 
# 6  2     2  57                                 
# 7  2     3  28 student loan         no consumer loan        yes
# 8  2     4  33  bank credit        yes                 
# 9  2     5  19  bank credit         no

fixing a messy dataframe with tidyr in R

Answers (1)

Data

Step-through

TL;DR

Related Questions