Reputation: 21
I have a bunch of strings of the format "Initials - Month - Year" which i want to split. However, the exact format is not consistent due to user input. Some examples:
'AA-JAN17'
'AA- JAN 17'
'AA-JAN-17'
'AA - JAN - 17'
'AA - 01 - 2017'
What I want is ['AA', 'JAN', '17']
. Converting 01 to JAN or 2017 to 17 is trivial.
I can split on a hyphen and remove spaces by doing
st = 'AA-JAN-17'
list = [s.strip() for s in st.split('-')]
which will work, except for the first and second example where there is no hyphen between month and year. I could probably split on both letters/numbers and on the hyphen but I'm not sure how to do this. This could probably be done with regex but I'm not familiar with that at all.
I accept that there are any number of ways the string could be entered but if there's something that can work for all the examples above then that will be good enough for most cases.
Upvotes: 0
Views: 922
Reputation:
This should give you what you're looking for
string = 'AA - 01 - 2017'
string = string.replace(' ', '')
string = string.replace('-', '')
string = string.replace('20', '')
date_list = [string[0] + string[1], str(string[2]) + str(string[3]), str(string[4]) + str(string[5])]
print(date_list)
Upvotes: 1
Reputation: 5855
I would recommend using a regular expression for this. Depending on how structured (or not) your input is you might have to spend some effort on finding an expression that correctly handles all cases. My for all cases that have been mentioned so far would be:
r"(?P<initials>\w+)\s*-?\s*(?P<month>\d{1,2}|JAN\w*)\s*-?\s*(?P<year>\d{2,4})"
You can study the effect with any number of online regex evaluators. I use this one.
Upvotes: 0
Reputation: 2545
I'd recommend a regex something like this:
import re
samples = ['AA-JAN17',
'AA- JAN 17',
'AA-JAN-17',
'AA - JAN - 17',
'AA - 01 - 2017',
"AA0117"]
input_pat = re.compile(r"([a-z]{2})[- ]*([a-z]{3}|[0-9]{2})[- ]*([0-9]*)", re.I)
for sample in samples:
print(input_pat.match(sample).groups())
This will have the following output:
('AA', 'JAN', '17')
('AA', 'JAN', '17')
('AA', 'JAN', '17')
('AA', 'JAN', '17')
('AA', '01', '2017')
('AA', '01', '17')
It makes a couple of assumptions (initials will be exactly 2 characters, month will be three letters or two digits), which you could modify.
Upvotes: 1
Reputation: 350300
You could indeed use a regular expression. I would suggest one that matches any series or digits, or any series of letters:
import re
lst = re.findall(r"\d+|[a-z]+", "AA-JAN17", re.I)
Upvotes: 0