Separating with multiple alternative delimiters in tidyr

Question

I have a column of inconsistently formatted strings that I am trying to split using tidyr::separate_wider_delim(). These either have the pattern "text[space]text", or "text[space]text[comma][space]text" (see example below). In the former, I want to split at the space; in the latter I want to split at the comma/space combination (which is always going to be the second occurrence of a space).

Either of these situations is easy using separate_wider_delim(). However, I can't figure out how to do both at the same time. Doing them in sequence also doesn't work, because the space is consistent between them.

library(tidyr)

df <- data.frame(txt = c("December 2005", "January 2008", "January 3, 2001", "December 18, 2004"))
  
df %>%
  separate_wider_delim(txt, delim = ", ",
                       names = c("month", "year"),
                       too_few = "align_start")

What I'm trying to get to is:

month         year
December      2005
January       2008
January 3     2001
December 18   2004

I can use regex to identify the location of the second space:

str_locate_all(txt, "\s")[[1]][2, 2]

But I can't figure out how to use this to define the delimiter. I could probably just grab the last four characters pretty easily, but my real data is messier than this, so I would greatly prefer something that relies on the delimiter (if only one space, then the space; if multiple spaces, then the second space/the space following the comma).

DuesserBaest · Accepted Answer

Staying with delim-based splitting you can use a regex to define the delimeter as:

,?: an optional comma
\s+: followed by the space
(?= ... ): before
- \d{4}: the 4-digit number (year)
- $: that comes before the end of string

See regex demo at regex101


library(tidyr)
library(stringr)

df <- data.frame(
  txt = c(
    "December 2005", 
    "January 2008", 
    "January 3, 2001", 
    "December 18, 2004"
    )
  )

df %>%
  separate_wider_delim(
    txt, 
    delim = stringr::regex(",?\s+(?=\d{4}$)"),
    names = c("month", "year"),
    too_few = "align_start"
    )

# # A tibble: 4 × 2
# month       year 
#        
# 1 December    2005 
# 2 January     2008 
# 3 January 3   2001 
# 4 December 18 2004

Separating with multiple alternative delimiters in tidyr

Answers (2)

Related Questions