Separating multiple variables using tidyr

Question

I have a dataframe df with ten variables var1, var2, var3,...,var10 to be separated. The values of the variables have this form: value_text. I would like to apply tidyr::separate() and actually I did ten times, once per variable.

tidyr::separate(col=var1,into=c("value1","text1"),extra="merge")%>%
tidyr::separate(col=var2,into=c("value2","text2"),extra="merge")%>%

...

Do you know any "more elegant" way to use tidyr::separate() at once (without separating 10 times)?

A5C1D2H2I1M1N2O1R2T1 · Accepted Answer

The best I can think of at the moment is something like this:

library(tidyverse)
mydf %>% 
  gather(var, val, everything()) %>%            # Gather all the columns into a key-value pair 
  separate(val, into = c("value", "text")) %>%  # Separate the gathered columns
  gather(key, val, value, text) %>%             # Gather again so you have key-key-value
  unite(cn, var, key) %>%                       # Unite your keys to become the column names
  group_by(cn) %>%                              # Keys are duplicated; we need to know their origins
  mutate(rn = sequence(n())) %>%                # .. so we add row numbers
  spread(cn, val)                               # Then we convert to the wide form
# # A tibble: 5 × 21
#      rn var1_text var1_value var10_text var10_value var2_text var2_value var3_text var3_value var4_text var4_value
# *                                                          
# 1     1      POIL        235       ZHKV         555      QVWK        479      SKCY        454      YCTY        704
# 2     2      NENB        928       CJLE         956      JXQT        379      HPCV        186      VIQY        764
# 3     3      HFHX        966       PZZE         622      PXHE        261      IUGF        717      YIGE        842
# 4     4      LUWK        021       OJRH         741      XNWE        230      NGEN        486      INYN        003
# 5     5      BISI        637       MEJS         718      TSYO        383      ODGS        755      GBKW        564
# # ... with 10 more variables: var5_text , var5_value , var6_text , var6_value ,
# #   var7_text , var7_value , var8_text , var8_value , var9_text , var9_value 
# Warning message:
# attributes are not identical across measure variables; they will be dropped

Here's the sample data I used:

library(stringi)
set.seed(1)
mydf <- data.frame(matrix(sprintf("%s_%s", stri_rand_strings(50, 3, "[0-9]"), 
                                  stri_rand_strings(50, 4, "[A-Z]")), ncol = 10,
                          dimnames = list(NULL, paste0("var", 1:10))))
mydf
#       var1     var2     var3     var4     var5     var6     var7     var8     var9    var10
# 1 235_POIL 479_QVWK 454_SKCY 704_YCTY 924_JDJQ 883_TYMP 206_BCJE 214_FDZI 944_DFVS 555_ZHKV
# 2 928_NENB 379_JXQT 186_HPCV 764_VIQY 362_KRRO 794_MCGM 877_HEGE 959_NRCD 174_GMCJ 956_CJLE
# 3 966_HFHX 261_PXHE 717_IUGF 842_YIGE 470_LLHP 733_JYNI 448_MUAN 734_BYRC 522_ZQRI 622_PZZE
# 4 021_LUWK 230_XNWE 486_NGEN 003_INYN 838_XDKF 727_HUSE 663_WJBD 107_MMJZ 550_KZWY 741_OJRH
# 5 637_BISI 383_TSYO 755_ODGS 564_GBKW 334_LDLY 121_BTQE 296_IEDF 146_EVBK 069_VUGT 718_MEJS

NOTE: This seems to be entirely the opposite data format of what would be recommended by the authors of the "tidyverse" set of packages.

As mentioned, an alternative would be to use cSplit from my "splitstackshape" package. With that, the problem is easily resolved as follows:

library(splitstackshape)
cSplit(mydf, names(mydf), "_", type.convert = FALSE)
##    var1_1 var1_2 var2_1 var2_2 var3_1 var3_2 var4_1 var4_2 var5_1 var5_2 var6_1 var6_2 var7_1 var7_2 var8_1 var8_2
## 1:    235   POIL    479   QVWK    454   SKCY    704   YCTY    924   JDJQ    883   TYMP    206   BCJE    214   FDZI
## 2:    928   NENB    379   JXQT    186   HPCV    764   VIQY    362   KRRO    794   MCGM    877   HEGE    959   NRCD
## 3:    966   HFHX    261   PXHE    717   IUGF    842   YIGE    470   LLHP    733   JYNI    448   MUAN    734   BYRC
## 4:    021   LUWK    230   XNWE    486   NGEN    003   INYN    838   XDKF    727   HUSE    663   WJBD    107   MMJZ
## 5:    637   BISI    383   TSYO    755   ODGS    564   GBKW    334   LDLY    121   BTQE    296   IEDF    146   EVBK
##    var9_1 var9_2 var10_1 var10_2
## 1:    944   DFVS     555    ZHKV
## 2:    174   GMCJ     956    CJLE
## 3:    522   ZQRI     622    PZZE
## 4:    550   KZWY     741    OJRH
## 5:    069   VUGT     718    MEJS

Another option if you want to stick with the tidyverse is to use a for loop.

for (i in names(mydf)) mydf <- separate_(mydf, i, paste0(i, c("_text", "_value")))

Separating multiple variables using tidyr

Answers (1)

Related Questions