Reputation: 161
I have a dataframe df
with ten variables var1, var2, var3,...,var10
to be separated. The values of the variables have this form: value_text. I would like to apply tidyr::separate()
and actually I did ten times, once per variable.
tidyr::separate(col=var1,into=c("value1","text1"),extra="merge")%>%
tidyr::separate(col=var2,into=c("value2","text2"),extra="merge")%>%
...
Do you know any "more elegant" way to use tidyr::separate()
at once (without separating 10 times)?
Upvotes: 1
Views: 582
Reputation: 193687
The best I can think of at the moment is something like this:
library(tidyverse)
mydf %>%
gather(var, val, everything()) %>% # Gather all the columns into a key-value pair
separate(val, into = c("value", "text")) %>% # Separate the gathered columns
gather(key, val, value, text) %>% # Gather again so you have key-key-value
unite(cn, var, key) %>% # Unite your keys to become the column names
group_by(cn) %>% # Keys are duplicated; we need to know their origins
mutate(rn = sequence(n())) %>% # .. so we add row numbers
spread(cn, val) # Then we convert to the wide form
# # A tibble: 5 × 21
# rn var1_text var1_value var10_text var10_value var2_text var2_value var3_text var3_value var4_text var4_value
# * <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
# 1 1 POIL 235 ZHKV 555 QVWK 479 SKCY 454 YCTY 704
# 2 2 NENB 928 CJLE 956 JXQT 379 HPCV 186 VIQY 764
# 3 3 HFHX 966 PZZE 622 PXHE 261 IUGF 717 YIGE 842
# 4 4 LUWK 021 OJRH 741 XNWE 230 NGEN 486 INYN 003
# 5 5 BISI 637 MEJS 718 TSYO 383 ODGS 755 GBKW 564
# # ... with 10 more variables: var5_text <chr>, var5_value <chr>, var6_text <chr>, var6_value <chr>,
# # var7_text <chr>, var7_value <chr>, var8_text <chr>, var8_value <chr>, var9_text <chr>, var9_value <chr>
# Warning message:
# attributes are not identical across measure variables; they will be dropped
Here's the sample data I used:
library(stringi)
set.seed(1)
mydf <- data.frame(matrix(sprintf("%s_%s", stri_rand_strings(50, 3, "[0-9]"),
stri_rand_strings(50, 4, "[A-Z]")), ncol = 10,
dimnames = list(NULL, paste0("var", 1:10))))
mydf
# var1 var2 var3 var4 var5 var6 var7 var8 var9 var10
# 1 235_POIL 479_QVWK 454_SKCY 704_YCTY 924_JDJQ 883_TYMP 206_BCJE 214_FDZI 944_DFVS 555_ZHKV
# 2 928_NENB 379_JXQT 186_HPCV 764_VIQY 362_KRRO 794_MCGM 877_HEGE 959_NRCD 174_GMCJ 956_CJLE
# 3 966_HFHX 261_PXHE 717_IUGF 842_YIGE 470_LLHP 733_JYNI 448_MUAN 734_BYRC 522_ZQRI 622_PZZE
# 4 021_LUWK 230_XNWE 486_NGEN 003_INYN 838_XDKF 727_HUSE 663_WJBD 107_MMJZ 550_KZWY 741_OJRH
# 5 637_BISI 383_TSYO 755_ODGS 564_GBKW 334_LDLY 121_BTQE 296_IEDF 146_EVBK 069_VUGT 718_MEJS
NOTE: This seems to be entirely the opposite data format of what would be recommended by the authors of the "tidyverse" set of packages.
As mentioned, an alternative would be to use cSplit
from my "splitstackshape" package. With that, the problem is easily resolved as follows:
library(splitstackshape)
cSplit(mydf, names(mydf), "_", type.convert = FALSE)
## var1_1 var1_2 var2_1 var2_2 var3_1 var3_2 var4_1 var4_2 var5_1 var5_2 var6_1 var6_2 var7_1 var7_2 var8_1 var8_2
## 1: 235 POIL 479 QVWK 454 SKCY 704 YCTY 924 JDJQ 883 TYMP 206 BCJE 214 FDZI
## 2: 928 NENB 379 JXQT 186 HPCV 764 VIQY 362 KRRO 794 MCGM 877 HEGE 959 NRCD
## 3: 966 HFHX 261 PXHE 717 IUGF 842 YIGE 470 LLHP 733 JYNI 448 MUAN 734 BYRC
## 4: 021 LUWK 230 XNWE 486 NGEN 003 INYN 838 XDKF 727 HUSE 663 WJBD 107 MMJZ
## 5: 637 BISI 383 TSYO 755 ODGS 564 GBKW 334 LDLY 121 BTQE 296 IEDF 146 EVBK
## var9_1 var9_2 var10_1 var10_2
## 1: 944 DFVS 555 ZHKV
## 2: 174 GMCJ 956 CJLE
## 3: 522 ZQRI 622 PZZE
## 4: 550 KZWY 741 OJRH
## 5: 069 VUGT 718 MEJS
Another option if you want to stick with the tidyverse is to use a for
loop.
for (i in names(mydf)) mydf <- separate_(mydf, i, paste0(i, c("_text", "_value")))
Upvotes: 1