user3919790
user3919790

Reputation: 557

Rearrange data by matching columns

I am having issue with rearranging some data.

The original data is:

structure(list(id = 1:3, artery.1 = structure(c(1L, 1L, 2L), .Label = c("a", 
"b"), class = "factor"), artery.2 = structure(c(1L, NA, 2L), .Label = c("b", 
"c"), class = "factor"), artery.3 = structure(c(1L, NA, 2L), .Label = c("c", 
"d"), class = "factor"), artery.4 = structure(c(NA, NA, 1L), .Label = "e", class = "factor"), artery.5 = structure(c(NA, NA, 1L), .Label = "f", class = "factor"), 
diameter.1 = c(3L, 2L, 1L), diameter.2 = c(2L, NA, 2L), diameter.3 = c(3L, 
NA, 3L), diameter.4 = c(NA, NA, 4L), diameter.5 = c(NA, NA, 
5L)), .Names = c("id", "artery.1", "artery.2", "artery.3", 
"artery.4", "artery.5", "diameter.1", "diameter.2", "diameter.3", 
"diameter.4", "diameter.5"), class = "data.frame", row.names = c(NA, 
-3L))

#   id artery.1 artery.2 artery.3 artery.4 artery.5 diameter.1 diameter.2 diameter.3 diameter.4 diameter.5
# 1  1        a        b        c     <NA>     <NA>          3          2          3         NA         NA
# 2  2        a     <NA>     <NA>     <NA>     <NA>          2         NA         NA         NA         NA
# 3  3        b        c        d        e        f          1          2          3          4          5

I would like to get to this:

structure(list(id = 1:3, a = c(3L, 2L, NA), b = c(2L, NA, 1L), 
c = c(3L, NA, 2L), d = c(NA, NA, 3L), e = c(NA, NA, 4L), 
f = c(NA, NA, 5L)), .Names = c("id", "a", "b", "c", "d", 
"e", "f"), class = "data.frame", row.names = c(NA, -3L))

#   id  a  b  c  d  e  f
# 1  1  3  2  3 NA NA NA
# 2  2  2 NA NA NA NA NA
# 3  3 NA  1  2  3  4  5

Basically, a to f represents arteries and the numerical values represent the corresponding diameter. Each row represents a patient.

Is there a neat way to sort this dataframe out?

Upvotes: 1

Views: 91

Answers (4)

jdobres
jdobres

Reputation: 11957

Modern tidyr makes the solution even more succinct via the pivot_ functions:

library(dplyr)
library(tidyr)

df %>% 
  pivot_longer(-id, names_pattern = '(artery|diameter)\\.(\\d+)', names_to = c('.value', NA)) %>% 
  filter(!is.na(artery)) %>% 
  pivot_wider(names_from = artery, values_from = diameter)

     id     a     b     c     d     e     f
  <int> <int> <int> <int> <int> <int> <int>
1     1     3     2     3    NA    NA    NA
2     2     2    NA    NA    NA    NA    NA
3     3    NA     1     2     3     4     5

Here is the older solution, which uses the deprecated gather and spread functions:

library(dplyr)
library(tidyr)

new.df <- gather(df, variable, value, artery.1:diameter.5) %>% 
    separate(variable, c('variable', 'num')) %>% 
    spread(variable, value) %>% 
    subset(!is.na(artery)) %>%
    mutate(diameter = as.numeric(diameter)) %>% 
    select(-num) %>% 
    spread(artery, diameter)

Output:

  id  a  b  c  d  e  f
1  1  3  2  3 NA NA NA
2  2  2 NA NA NA NA NA
3  3 NA  1  2  3  4  5

Upvotes: 3

David Arenburg
David Arenburg

Reputation: 92300

Or using melt/dcast combination with data.table while selecting variables using regex in the patterns function

library(data.table) #v>=1.9.6
dcast(melt(setDT(df), 
           id = "id", 
           measure = patterns("artery", "diameter")),
      id ~ value1, 
      sum, 
      value.var = "value2", 
      subset = .(!is.na(value2)), 
      fill = NA)
#    id  a  b  c  d  e  f
# 1:  1  3  2  3 NA NA NA
# 2:  2  2 NA NA NA NA NA
# 3:  3 NA  1  2  3  4  5

As you can see, both melt and dcast are very flexible and you can use regex, specify a subset, pass multiple functions and specify how you want to fill missing values.

Upvotes: 2

bgoldst
bgoldst

Reputation: 35324

This can be done with two reshape() calls. First, we can longify both artery and diameter on id, then widen with artery as the time variable. To prevent a column of NAs, we also must subset out rows with NA values for artery in the intermediate frame.

reshape(subset(reshape(df,dir='l',varying=setdiff(names(df),'id'),timevar=NULL),!is.na(artery)),dir='w',timevar='artery');
##     id diameter.a diameter.b diameter.c diameter.d diameter.e diameter.f
## 1.1  1          3          2          3         NA         NA         NA
## 2.1  2          2         NA         NA         NA         NA         NA
## 3.1  3         NA          1          2          3          4          5

The diameter. prefixes can be removed afterward, if desired. However, an advantage of this solution is that it would be capable of preserving multiple column sets, whereas the xtabs() solution cannot. The prefixes would be essential to distinguish the column sets in that case.

Upvotes: 1

akuiper
akuiper

Reputation: 215117

You can use xtabs with reshape from base R. Use the latter to transform data to long format and use the former to get the count table:

xtabs(diameter ~ id + artery, reshape(df, varying = 2:11, sep = '.', dir = "long"))

#   artery
#id  a b c d e f
#  1 3 2 3 0 0 0
#  2 2 0 0 0 0 0
#  3 0 1 2 3 4 5

Upvotes: 1

Related Questions