Paul Engelbert
Paul Engelbert

Reputation: 97

Split categorical values of one column into more columns

I am dealing with the situation where data from a survey question have multiple answers. So a respondent who has answered the question was able to tick more than one box. The result is that data set includes the multiple answers together in as one value.

df <- c("VrolijkGemotiveerd", "RelaxtGemotiveerdVrolijk", "Neutraal", "TrotsGezegend", "Neutraal", "Neutraal", "VermoeidGemotiveerd")   

I want to split for example RelaxtGemotiveerdVrolijk into Column 1: Relaxt en Column 2: Gemotiveerd and Column 3: Vrolijk.

Upvotes: 0

Views: 481

Answers (2)

wurli
wurli

Reputation: 2748

It looks like you want to split each string wherever an upper-case letter occurs, which can be done using a regular expression. There are lots of functions that you can use to apply regexes in this way, e.g. strsplit(), stringr::str_split() etc, but tidyr has a function specifically for adding new columns using this method:

df <- data.frame(
    c1 = c("VrolijkGemotiveerd", "RelaxtGemotiveerdVrolijk", "Neutraal", 
           "TrotsGezegend", "Neutraal", "Neutraal", "VermoeidGemotiveerd")
)

tidyr::separate(df, c1, into = c("c2", "c3", "c4"), 
                sep = "(?<=.)(?=[[:upper:]])", fill = "right", remove = FALSE)
#>                         c1       c2          c3      c4
#> 1       VrolijkGemotiveerd  Vrolijk Gemotiveerd    <NA>
#> 2 RelaxtGemotiveerdVrolijk   Relaxt Gemotiveerd Vrolijk
#> 3                 Neutraal Neutraal        <NA>    <NA>
#> 4            TrotsGezegend    Trots    Gezegend    <NA>
#> 5                 Neutraal Neutraal        <NA>    <NA>
#> 6                 Neutraal Neutraal        <NA>    <NA>
#> 7      VermoeidGemotiveerd Vermoeid Gemotiveerd    <NA>

EDIT: Updated to use the regular expression from @Laterow's answer, as mine was a bit broken.

Upvotes: 1

slamballais
slamballais

Reputation: 3235

Answer

Assuming that categories always start with capital letters, use strsplit with perl-compatible regular expressions:

strsplit(df, "(?<=.)(?=[[:upper:]])", perl = TRUE)

Output:

[[1]]
[1] "Vrolijk"     "Gemotiveerd"

[[2]]
[1] "Relaxt"      "Gemotiveerd" "Vrolijk"    

[[3]]
[1] "Neutraal"

[[4]]
[1] "Trots"    "Gezegend"

[[5]]
[1] "Neutraal"

[[6]]
[1] "Neutraal"

[[7]]
[1] "Vermoeid"    "Gemotiveerd"

Rationale

strsplit let's you split strings by a pattern. Regular expressions allow you to operate on patterns in strings. The pattern is to find the capital letter (i.e. [[:upper:]]). The other parts are necessary to properly split at each capital letter, to maintain the letter you split on, and to split before the capital letter rather than after.

This code returns a list that you can then use for further processing.

Upvotes: 0

Related Questions