Jaken
Jaken

Reputation: 521

Separating with multiple alternative delimiters in tidyr

I have a column of inconsistently formatted strings that I am trying to split using tidyr::separate_wider_delim(). These either have the pattern "text[space]text", or "text[space]text[comma][space]text" (see example below). In the former, I want to split at the space; in the latter I want to split at the comma/space combination (which is always going to be the second occurrence of a space).

Either of these situations is easy using separate_wider_delim(). However, I can't figure out how to do both at the same time. Doing them in sequence also doesn't work, because the space is consistent between them.

library(tidyr)

df <- data.frame(txt = c("December 2005", "January 2008", "January 3, 2001", "December 18, 2004"))
  
df %>%
  separate_wider_delim(txt, delim = ", ",
                       names = c("month", "year"),
                       too_few = "align_start")

What I'm trying to get to is:

month         year
December      2005
January       2008
January 3     2001
December 18   2004

I can use regex to identify the location of the second space:

str_locate_all(txt, "\\s")[[1]][2, 2]

But I can't figure out how to use this to define the delimiter. I could probably just grab the last four characters pretty easily, but my real data is messier than this, so I would greatly prefer something that relies on the delimiter (if only one space, then the space; if multiple spaces, then the second space/the space following the comma).

Upvotes: 2

Views: 94

Answers (2)

DuesserBaest
DuesserBaest

Reputation: 2829

Staying with delim-based splitting you can use a regex to define the delimeter as:

  • ,?: an optional comma
  • \s+: followed by the space
  • (?= ... ): before
    • \d{4}: the 4-digit number (year)
    • $: that comes before the end of string

See regex demo at regex101


library(tidyr)
library(stringr)

df <- data.frame(
  txt = c(
    "December 2005", 
    "January 2008", 
    "January 3, 2001", 
    "December 18, 2004"
    )
  )

df %>%
  separate_wider_delim(
    txt, 
    delim = stringr::regex(",?\\s+(?=\\d{4}$)"),
    names = c("month", "year"),
    too_few = "align_start"
    )

# # A tibble: 4 × 2
# month       year 
# <chr>       <chr>
# 1 December    2005 
# 2 January     2008 
# 3 January 3   2001 
# 4 December 18 2004 

Upvotes: 4

Darren Tsai
Darren Tsai

Reputation: 35604

You can use separate_wider_regex():

library(tidyr)

df %>%
  separate_wider_regex(txt, c(month = "[^,]+", ",*\\s+", year = ".+"))

# # A tibble: 4 × 2
#   month       year 
#   <chr>       <chr>
# 1 December    2005 
# 2 January     2008 
# 3 January 3   2001 
# 4 December 18 2004

Upvotes: 5

Related Questions