cs_is_great
cs_is_great

Reputation: 43

R - How to remove dates from a string?

I am cleaning up a table that contains all sorts of weird naming conventions. One of the names I keep seeing is a string of dates. Some of these names contain numbers which are okay but I would like to remove date formats from the strings.

Edit - Dates are either in mm/YY or mm/YYYY format. The dates are normally from 2017 onwards as I have seen (we want more recent updates).

For example:

names <- c('IT Company 09/18',  'Tech Company 9/17', '9/2018 XYZ Company', '50/50 Phone Company')

Should be:

c('IT Company', 'Tech Company', 'XYZ Company', '50/50 Phone Company') 

I tried to use this function here to flag strings with "/" and dates but it also extracts numbers that are not dates:

names2 <- names[grepl("[[:digit:]]", names) & grepl("/", names)]

Output

> names2
[1] "IT Company 09/18"   
[2] "Tech Company 9/17"  
[3] "9/2018 XYZ Company" 
[4] "50/50 Phone Company"

Is there a specific date expression I can use in place of [[:digit:]] to find strings with dates?

Also, what is the function to remove dates including the slash from a string?

Upvotes: 1

Views: 1752

Answers (3)

HarlandMason
HarlandMason

Reputation: 789

Sounds like a job for the stringr packages' string_remove_all function.

The trick is getting the regex pattern right. The fact that you don't have a standard date format you're looking out for and that you'd like to retain 50/50 make life tough.

This worked for me:

library('stringr')

date.pattern <- ' ?(0|1)?[1-9]/([0-9]{4}|[0-9]{2}) ?'

names <- c('IT Company 09/18',  'Tech Company 9/17', '9/2018 XYZ Company', '50/50 Phone Company')
str_remove_all(names, date.pattern)

That regex is supposed to say

  • "A date is one number, or two numbers where the first number is 0 or 1. (Assumes the first number is always a month, which may have a leading zero for single digit months).

  • Then it (lazily) allows for years of length 2 or 4. This could be made more precise if you have a sense of which years you expect to encounter. "0001" probably isn't a year in your dataset.

  • Finally, it will also remove leading or trailing spaces if they exist, since it looks like dates can come before and after the part you care about and are space-delimited from the part you care about.

Output

[1] "IT Company"          "Tech Company"        "XYZ Company"        
[4] "50/50 Phone Company"

It might be smart to also run this through str_extract_all and inspect the outputs to see if they all have the appearance of dates.

Upvotes: 4

Calum You
Calum You

Reputation: 15072

Here is a stringr solution. This is complicated because it is hard to tell whether something is a date just by looking at it: you need to check your outputs and adjust this code if there are more cases. The first regular expression catches formats that are 1/2 digits, /, a 0 or 1, and one more digit, assuming any dates are post year 2000. The second line gets rid of digit, / and then four digits. This nicely excludes 50/50, but what if it was called 12/50 Phone Company? That could be interpreted as Dec 1950 in a certain date format. You'll have to make sure you don't accidentally exclude any similar special cases, there isn't a universal way to tell whether something is meant to be a date or not.

I would read up on regular expressions; this is a good resource to start.

library(stringr)
names <- c('IT Company 09/18',  'Tech Company 9/17', '9/2018 XYZ Company', '50/50 Phone Company')

names %>%
  str_replace("\\d{1,2}/(0|1)\\d", "") %>%
  str_replace("\\d/\\d{4}", "") %>%
  str_trim()
#> [1] "IT Company"          "Tech Company"        "XYZ Company"        
#> [4] "50/50 Phone Company"

Created on 2018-03-14 by the reprex package (v0.2.0).

Upvotes: 1

Rui Barradas
Rui Barradas

Reputation: 76402

I believe the following will do what you want. It uses a regex followed by trimws to trim the white spaces from the beginning and end of the result.

trimws(gsub("[[:digit:]]{1,4}/[[:digit:]]{1,4}", "", names))
#[1] "IT Company"    "Tech Company"  "XYZ Company"   "Phone Company"

Upvotes: 1

Related Questions