Reputation: 41
I have a large data set a small sample of which looks like the 4 x 5 tibble below. I'm trying to split multiple delimited columns into unique rows using variable c=="Split"
as below:
library(splitstackshape)
dt <- tibble(
a = c("Quartz | White Spirit | Wildfire", "Quiet Riot", "Race Against Time", "Down | Heart Lane | X | Breaking H"),
b = c("Muthas Pride", "Killer Girls / Slick Black Cadillac", "Demo 1980", "Life 55"),
c = c("Split", "Single", "Demo", "Split"),
d = c("Birmingham, England | Hartlepool, England | Sheffield, South Yorkshire, England", "Los Angeles, California", "Nottingham, England", "Liverpool | Beijing | | NYC"),
e = c("wf | ef | ff", "g", "f", "cf | af | df | rf")
)
dt.s <- subset(dt, c == "Split")
dt.split <- cSplit(dt.s, c("a", "d", "e"), c("|", "|", "|"), "long")
dt.split
However, this coerces an extra row of NAs as seen in row 4:
a b c d e
1: Quartz Muthas Pride Split Birmingham, England wf
2: White Spirit Muthas Pride Split Hartlepool, England ef
3: Wildfire Muthas Pride Split Sheffield, South Yorkshire, England ff
4: NA Muthas Pride Split NA NA
5: Down Life 55 Split Liverpool cf
6: Heart Lane Life 55 Split Beijing af
7: X Life 55 Split df
8: Breaking H Life 55 Split NYC rf
This is not a problem if I split only two columns. How do I get it to not produce the NA row? And, is there a way to make cSplit
work without subsetting by c
?
Upvotes: 1
Views: 237
Reputation: 193657
Try adding makeEqual = FALSE
to your cSplit
call:
cSplit(dt.s, c("a", "d", "e"), "|", "long", makeEqual = FALSE)
## a b c d e
## 1: Quartz Muthas Pride Split Birmingham, England wf
## 2: White Spirit Muthas Pride Split Hartlepool, England ef
## 3: Wildfire Muthas Pride Split Sheffield, South Yorkshire, England ff
## 4: Down Life 55 Split Liverpool cf
## 5: Heart Lane Life 55 Split Beijing af
## 6: X Life 55 Split df
## 7: Breaking H Life 55 Split NYC rf
Also, since you're already using packages from the "tidyverse", you can subset along with splitting, like this:
dt %>%
filter(c == "Split") %>%
cSplit(c("a", "d", "e"), "|", "long", makeEqual = FALSE)
Upvotes: 0
Reputation: 887651
As we are using a tibble
, we could use the separate_rows
, which is not giving the NA
rows
library(tidyr)
separate_rows(dt.s, c('a', "d", "e"), sep="\\s*\\|\\s*") %>%
select_at(names(dt.s))
# A tibble: 7 x 5
# a b c d e
# <chr> <chr> <chr> <chr> <chr>
#1 Quartz Muthas Pride Split Birmingham, England wf
#2 White Spirit Muthas Pride Split Hartlepool, England ef
#3 Wildfire Muthas Pride Split Sheffield, South Yorkshire, England ff
#4 Down Life 55 Split Liverpool cf
#5 Heart Lane Life 55 Split Beijing af
#6 X Life 55 Split df
#7 Breaking H Life 55 Split NYC rf
Regarding why the cSplit
is giving an extra row of NA, it would be better to check the output in the 'wide' format
cSplit(dt.s, c("a", "d", "e"), "|")
# b c a_1 a_2 a_3 a_4 d_1 d_2 d_3 d_4 e_1 e_2 e_3 e_4
#1: Muthas Pride Split Quartz White Spirit Wildfire NA Birmingham, England Hartlepool, England Sheffield, South Yorkshire, England NA wf ef ff NA
#2: Life 55 Split Down Heart Lane X Breaking H Liverpool Beijing NYC cf af df rf
Here, we find that for the second row, the number of delimiter i.e. |
is 4 creating a NA
for the first row as there are only 3 delimiters for the 'a column. So, when we use the 'long' format, this NA
row propagates. It could be a bug.
Upvotes: 0