Reputation: 887
I have an input data set as follows -
INPUT = [
'ABCD , D.O.B: - Jun/14/1999.',
'EFGH , DOB; - Jan/10/1998,',
'IJKL , D-O-B - Jul/15/1985..',
'MNOP , (DOB)* - Dec/21/1999,',
'QRST , *DOB* - Apr/01/2000.',
'UVWX , D O B, - Feb/11/2001 '
]
I would like this to be in the following formatted output form -
OUTPUT = [
('ABCD, Jun/14/1999'),
('EFGH, Jan/10/1998'),
('IJKL, Jul/15/1985'),
('MNOP, Dec/21/1999'),
('QRST, Apr/1/2000'),
('UVWX, Feb/11/2001')
]
I tried the following code which works partly but I am unable to do the formatting in the desired OUTPUT format -
import re
INPUT = [
'ABCD , D.O.B: - Jun/14/1999.',
'EFGH , DOB; - Jan/10/1998,',
'IJKL , D-O-B - Jul/15/1985..',
'MNOP , (DOB)* - Dec/21/1999,',
'QRST , *DOB* - Apr/01/2000.',
'UVWX , D O B, - Feb/11/2001 '
]
def formatted_def(input):
for n in input:
t = re.sub('[^a-zA-Z0-9 ]+','',n).split('DOB')
print(t)
formatted_def(INPUT)
Output -
['ABCD ', ' Jun141999']
['EFGH ', ' Jan101998']
['IJKL ', ' Jul151985']
['MNOP ', ' Dec211999']
['QRST ', ' Apr012000']
['UVWX D O B Feb112001 ']
Any pointers will be very helpful. Thanks in advance!
Upvotes: 1
Views: 109
Reputation: 30971
The main difficult point is to get ('ABCD, Jun/14/1999'),
content.
It can not be a single-element tuple, as it would have been printed
as ('ABCD, Jun/14/1999',),
(note extra ,
before the )
).
So to get exactly the result you wanted, I did it using
a series of print
statements.
The whole script (in Python 3) can be as follows:
import re
input = [
'ABCD , D.O.B: - Jun/14/1999.',
'EFGH , DOB; - Jan/10/1998,',
'IJKL , D-O-B - Jul/15/1985..',
'MNOP , (DOB)* - Dec/21/1999,',
'QRST , *DOB* - Apr/01/2000.',
'UVWX , D O B, - Feb/11/2001 '
]
result = [ re.sub(r'^([a-z]+).*? - ([a-z]{3}/\d{2}/\d{4}).*',
r'\1, \2', txt, flags = re.IGNORECASE) for txt in input ]
print('OUTPUT = [')
for txt in result:
print(" ('{}')".format(txt))
print(']')
Upvotes: 0
Reputation: 12005
import re
re.findall(r'(\w+)\s+,.*?-\s+([^., ]*)', ' '.join(INPUT))
# [('ABCD', 'Jun/14/1999'), ('EFGH', 'Jan/10/1998'), ('IJKL', 'Jul/15/1985'), ('MNOP', 'Dec/21/1999'), ('QRST', 'Apr/01/2000'), ('UVWX', 'Feb/11/2001')]
Upvotes: 2
Reputation: 51894
In addition to the other answer, you can also use re.sub
:
INPUT = [
'ABCD , D.O.B: - Jun/14/1999.',
'EFGH , DOB; - Jan/10/1998,',
'IJKL , D-O-B - Jul/15/1985..',
'MNOP , (DOB)* - Dec/21/1999,',
'QRST , *DOB* - Apr/01/2000.',
'UVWX , D O B, - Feb/11/2001 '
]
pattern = r'(?i)^([a-z]+).*([a-z]{3}/\d{2}/\d{4}).*$'
OUTPUT = [re.sub(pattern, r'\1, \2', x) for x in INPUT]
# OUTPUT:
[
'ABCD, Jun/14/1999',
'EFGH, Jan/10/1998',
'IJKL, Jul/15/1985',
'MNOP, Dec/21/1999',
'QRST, Apr/01/2000',
'UVWX, Feb/11/2001'
]
Upvotes: 2
Reputation: 71451
You can use re.findall
:
import re
l = ['ABCD , D.O.B: - Jun/14/1999.', 'EFGH , DOB; - Jan/10/1998,', 'IJKL , D-O-B - Jul/15/1985..', 'MNOP , (DOB)* - Dec/21/1999,', 'QRST , *DOB* - Apr/01/2000.', 'UVWX , D O B, - Feb/11/2001 ']
final_data = [', '.join(re.findall('^\w+|[a-zA-Z]+/\d+/\d+(?=\W)', i)) for i in l]
Output:
['ABCD, Jun/14/1999', 'EFGH, Jan/10/1998', 'IJKL, Jul/15/1985', 'MNOP, Dec/21/1999', 'QRST, Apr/01/2000', 'UVWX, Feb/11/2001']
Upvotes: 2