1984
1984

Reputation: 41

cSplit Coerces Unnecessary NA Row

I have a large data set a small sample of which looks like the 4 x 5 tibble below. I'm trying to split multiple delimited columns into unique rows using variable c=="Split" as below:

library(splitstackshape)

dt <- tibble(
a = c("Quartz | White Spirit | Wildfire", "Quiet Riot", "Race Against Time", "Down | Heart Lane | X | Breaking H"),
b = c("Muthas Pride", "Killer Girls / Slick Black Cadillac", "Demo 1980", "Life 55"),
c = c("Split", "Single", "Demo", "Split"),
d = c("Birmingham, England | Hartlepool, England | Sheffield, South Yorkshire, England", "Los Angeles, California", "Nottingham, England", "Liverpool | Beijing | | NYC"),
e = c("wf | ef | ff", "g", "f", "cf | af | df | rf")
)

dt.s <- subset(dt, c == "Split")
dt.split <- cSplit(dt.s, c("a", "d", "e"), c("|", "|", "|"), "long") 
dt.split

However, this coerces an extra row of NAs as seen in row 4:

             a            b     c                                   d  e
1:       Quartz Muthas Pride Split                 Birmingham, England wf
2: White Spirit Muthas Pride Split                 Hartlepool, England ef
3:     Wildfire Muthas Pride Split Sheffield, South Yorkshire, England ff
4:           NA Muthas Pride Split                                  NA NA
5:         Down      Life 55 Split                           Liverpool cf
6:   Heart Lane      Life 55 Split                             Beijing af
7:            X      Life 55 Split                                     df
8:   Breaking H      Life 55 Split                                 NYC rf

This is not a problem if I split only two columns. How do I get it to not produce the NA row? And, is there a way to make cSplit work without subsetting by c?

Upvotes: 1

Views: 237

Answers (2)

A5C1D2H2I1M1N2O1R2T1
A5C1D2H2I1M1N2O1R2T1

Reputation: 193657

Try adding makeEqual = FALSE to your cSplit call:

cSplit(dt.s, c("a", "d", "e"), "|", "long", makeEqual = FALSE)
##               a            b     c                                   d  e
## 1:       Quartz Muthas Pride Split                 Birmingham, England wf
## 2: White Spirit Muthas Pride Split                 Hartlepool, England ef
## 3:     Wildfire Muthas Pride Split Sheffield, South Yorkshire, England ff
## 4:         Down      Life 55 Split                           Liverpool cf
## 5:   Heart Lane      Life 55 Split                             Beijing af
## 6:            X      Life 55 Split                                     df
## 7:   Breaking H      Life 55 Split                                 NYC rf

Also, since you're already using packages from the "tidyverse", you can subset along with splitting, like this:

dt %>% 
  filter(c == "Split") %>% 
  cSplit(c("a", "d", "e"), "|", "long", makeEqual = FALSE)

Upvotes: 0

akrun
akrun

Reputation: 887651

As we are using a tibble, we could use the separate_rows, which is not giving the NA rows

library(tidyr)
separate_rows(dt.s, c('a', "d", "e"), sep="\\s*\\|\\s*") %>%
         select_at(names(dt.s))
# A tibble: 7 x 5
#             a            b     c                                   d     e
#         <chr>        <chr> <chr>                               <chr> <chr>
#1       Quartz Muthas Pride Split                 Birmingham, England    wf
#2 White Spirit Muthas Pride Split                 Hartlepool, England    ef
#3     Wildfire Muthas Pride Split Sheffield, South Yorkshire, England    ff
#4         Down      Life 55 Split                           Liverpool    cf
#5   Heart Lane      Life 55 Split                             Beijing    af
#6            X      Life 55 Split                                        df
#7   Breaking H      Life 55 Split                                 NYC    rf

Regarding why the cSplit is giving an extra row of NA, it would be better to check the output in the 'wide' format

cSplit(dt.s, c("a", "d", "e"), "|")
#              b     c    a_1          a_2      a_3        a_4                 d_1                 d_2                                 d_3 d_4 e_1 e_2 e_3 e_4
#1: Muthas Pride Split Quartz White Spirit Wildfire         NA Birmingham, England Hartlepool, England Sheffield, South Yorkshire, England  NA  wf  ef  ff  NA
#2:      Life 55 Split   Down   Heart Lane        X Breaking H           Liverpool             Beijing                                     NYC  cf  af  df  rf

Here, we find that for the second row, the number of delimiter i.e. | is 4 creating a NA for the first row as there are only 3 delimiters for the 'a column. So, when we use the 'long' format, this NA row propagates. It could be a bug.

Upvotes: 0

Related Questions