Alex
Alex

Reputation: 245

Removing dates ( in any format) form a text column

Hope everyone is well. In my dataset there is column including free texts. My goal is to remove all dates in any format form the text. this is a snapshot of the data

df <- data.frame(
  text=c('tommorow is 2022 11 03',"I married on 2020-01-01",
         'why not going there on 2023/01/14','2023 08 01 will be great'))
df %>% select(text)

                               text
1            tommorow is 2022 11 03
2           I married on 2020-01-01
3 why not going there on 2023/01/14
4          2023 08 01 will be great

The outcome should look like

               text
1            tommorow is 
2            I married on 
3            why not going there on 
4            will be great

Thank you!

Upvotes: 0

Views: 350

Answers (1)

Andy Baxter
Andy Baxter

Reputation: 7646

Best approach would perhaps be to have a sensitive regex pattern:

df <- data.frame(
  text=c('tommorow is 2022 11 03',"I married on 2020-01-01",
         'why not going there on 2023/01/14','2023 08 01 will be great'))

library(tidyverse)

df |>
  mutate(left_text = str_trim(str_remove(text, "\\d{1,4}\\D\\d{1,2}\\D\\d{1,4}")))

#>                                text              left_text
#> 1            tommorow is 2022 11 03            tommorow is
#> 2           I married on 2020-01-01           I married on
#> 3 why not going there on 2023/01/14 why not going there on
#> 4          2023 08 01 will be great          will be great

This will match dates by:

  • \\d{1,4} = starting with either month (1-2 numeric characters), day (1-2 characters) or year (2-4 characters); followed by
  • \\D = anything that's not a number, i.e. the separator; followed by
  • \\d{1,2} = day or month (1-2 chars); followed by
  • \\D again; ending with
  • \\d{1,4} = day or year (1-2 or 2-4 chars)

The challenge is balancing sensitivity with specificity. This should not take out numbers which are clearly not dates, but might miss out:

  • dates with no year
  • dates with no separators
  • dates with double spaces between parts

But hopefully should catch every sensible date in your text column!

Further date detection examples:

library(tidyverse)

df <- data.frame(
  text = c(
    'tommorow is 2022 11 03',
    "I married on 2020-01-01",
    'why not going there on 2023/01/14',
    '2023 08 01 will be great',
    'A trickier example: January 05,2020',
    'or try Oct 2010',
    'dec 21/22 is another date'
  )
)


df |>
  mutate(left_text = str_remove(text, "\\d{1,4}\\D\\d{1,2}\\D\\d{1,4}") |> 
           str_remove(regex(paste0("(", paste(month.name, collapse = "|"),
                                   ")(\\D+\\d{1,2})?\\D+\\d{1,4}"),
                            ignore_case = TRUE)) |> 
           str_remove(regex(paste0("(", paste(month.abb, collapse = "|"),
                                   ")(\\D+\\d{1,2})?\\D+\\d{1,4}"),
                            ignore_case = TRUE)) |> 
           str_trim())

#>                                  text              left_text
#> 1              tommorow is 2022 11 03            tommorow is
#> 2             I married on 2020-01-01           I married on
#> 3   why not going there on 2023/01/14 why not going there on
#> 4            2023 08 01 will be great          will be great
#> 5 A trickier example: January 05,2020    A trickier example:
#> 6                     or try Oct 2010                 or try
#> 7           dec 21/22 is another date        is another date

Final Edit - doing replace with temporary placeholders

The following code should work on a wide range of date formats. It works by replacing in a specific order so as not to accidentally chop out bits of some dates. Gluing together pre-made regex patterns to hopefully give a clearer idea as to what each bit is doing:

library(tidyverse)

df <- data.frame(
  text = c(
    'tommorow is 2022 11 03',
    "I married on 2020-01-01",
    'why not going there on 2023/01/14',
    '2023 08 01 will be great',
    'A trickier example: January 05,2020',
    'or try Oct 26th 2010',
    'dec 21/22 is another date',
    'today is 2023-01-29 & tomorrow is 2022 11 03 & 2022-12-01',
    'A trickier example: January 05,2020',
    '2020-01-01 I married on 2020-12-01',
    'Adding in 1st December 2018',
    'And perhaps Jul 4th 2023'
  )
)



r_year <- "\\d{2,4}"
r_day <- "\\d{1,2}(\\w{1,2})?"  # With or without "st" etc.
r_month_num <- "\\d{1,2}"
r_month_ab <- paste0("(", paste(month.abb, collapse = "|"), ")") 
r_month_full <- paste0("(", paste(month.name, collapse = "|"), ")") 
r_sep <- "[^\\w]+"  # The separators can be anything but letters

library(glue)

df |>
  mutate(
    text = 
      # Any numeric day/month/year
      str_replace_all(text,
                      glue("{r_day}{r_sep}{r_month_num}{r_sep}{r_year}"),
                      "REP_DATE") |> 
      # Any numeric month/day/year
      str_replace_all(glue("{r_month_num}{r_sep}{r_day}{r_sep}{r_year}"),
                      "REP_DATE") |> 
      # Any numeric year/month/day
      str_replace_all(glue("{r_year}{r_sep}{r_month_num}{r_sep}{r_day}"),
                      "REP_DATE") |> 
      # Any day[th]/monthname/year or monthname/day[th]/year
      str_replace_all(regex(paste0(
        glue("({r_day}{r_sep})?({r_month_full}|{r_month_ab})", 
             "{r_sep}({r_day}{r_sep})?{r_year}")
        ), ignore_case = TRUE),
                           "REP_DATE") |> 
      # And transform all placeholders to required date
      str_replace_all("REP_DATE", "25th October 2022")
    )
#>                                                                              text
#> 1                                                   tommorow is 25th October 2022
#> 2                                                  I married on 25th October 2022
#> 3                                        why not going there on 25th October 2022
#> 4                                                 25th October 2022 will be great
#> 5                                           A trickier example: 25th October 2022
#> 6                                                        or try 25th October 2022
#> 7                                               25th October 2022 is another date
#> 8  today is 25th October 2022 & tomorrow is 25th October 2022 & 25th October 2022
#> 9                                           A trickier example: 25th October 2022
#> 10                               25th October 2022 I married on 25th October 2022
#> 11                                                    Adding in 25th October 2022
#> 12                                                  And perhaps 25th October 2022

This should catch all the most common ways of writing dates, even with added "st"s "nd"s and "th"s after day number and irrespective of ordering of parts (apart from any format which puts "year" in the middle between "day" and "month", but that seems unlikely).

Upvotes: 1

Related Questions