Reputation: 43
I am cleaning up a table that contains all sorts of weird naming conventions. One of the names I keep seeing is a string of dates. Some of these names contain numbers which are okay but I would like to remove date formats from the strings.
Edit - Dates are either in mm/YY or mm/YYYY format. The dates are normally from 2017 onwards as I have seen (we want more recent updates).
For example:
names <- c('IT Company 09/18', 'Tech Company 9/17', '9/2018 XYZ Company', '50/50 Phone Company')
Should be:
c('IT Company', 'Tech Company', 'XYZ Company', '50/50 Phone Company')
I tried to use this function here to flag strings with "/" and dates but it also extracts numbers that are not dates:
names2 <- names[grepl("[[:digit:]]", names) & grepl("/", names)]
Output
> names2
[1] "IT Company 09/18"
[2] "Tech Company 9/17"
[3] "9/2018 XYZ Company"
[4] "50/50 Phone Company"
Is there a specific date expression I can use in place of [[:digit:]] to find strings with dates?
Also, what is the function to remove dates including the slash from a string?
Upvotes: 1
Views: 1752
Reputation: 789
Sounds like a job for the stringr
packages' string_remove_all
function.
The trick is getting the regex pattern right. The fact that you don't have a standard date format you're looking out for and that you'd like to retain 50/50 make life tough.
This worked for me:
library('stringr')
date.pattern <- ' ?(0|1)?[1-9]/([0-9]{4}|[0-9]{2}) ?'
names <- c('IT Company 09/18', 'Tech Company 9/17', '9/2018 XYZ Company', '50/50 Phone Company')
str_remove_all(names, date.pattern)
That regex is supposed to say
"A date is one number, or two numbers where the first number is 0 or 1. (Assumes the first number is always a month, which may have a leading zero for single digit months).
Then it (lazily) allows for years of length 2 or 4. This could be made more precise if you have a sense of which years you expect to encounter. "0001" probably isn't a year in your dataset.
Finally, it will also remove leading or trailing spaces if they exist, since it looks like dates can come before and after the part you care about and are space-delimited from the part you care about.
Output
[1] "IT Company" "Tech Company" "XYZ Company"
[4] "50/50 Phone Company"
It might be smart to also run this through str_extract_all
and inspect the outputs to see if they all have the appearance of dates.
Upvotes: 4
Reputation: 15072
Here is a stringr
solution. This is complicated because it is hard to tell whether something is a date just by looking at it: you need to check your outputs and adjust this code if there are more cases. The first regular expression catches formats that are 1/2 digits, /
, a 0
or 1
, and one more digit, assuming any dates are post year 2000. The second line gets rid of digit, /
and then four digits. This nicely excludes 50/50
, but what if it was called 12/50 Phone Company
? That could be interpreted as Dec 1950 in a certain date format. You'll have to make sure you don't accidentally exclude any similar special cases, there isn't a universal way to tell whether something is meant to be a date or not.
I would read up on regular expressions; this is a good resource to start.
library(stringr)
names <- c('IT Company 09/18', 'Tech Company 9/17', '9/2018 XYZ Company', '50/50 Phone Company')
names %>%
str_replace("\\d{1,2}/(0|1)\\d", "") %>%
str_replace("\\d/\\d{4}", "") %>%
str_trim()
#> [1] "IT Company" "Tech Company" "XYZ Company"
#> [4] "50/50 Phone Company"
Created on 2018-03-14 by the reprex package (v0.2.0).
Upvotes: 1
Reputation: 76402
I believe the following will do what you want. It uses a regex followed by trimws
to trim the white spaces from the beginning and end of the result.
trimws(gsub("[[:digit:]]{1,4}/[[:digit:]]{1,4}", "", names))
#[1] "IT Company" "Tech Company" "XYZ Company" "Phone Company"
Upvotes: 1