shrek
shrek

Reputation: 887

Regular Expressions and formatting in Python

I have an input data set as follows -

INPUT = [
'ABCD , D.O.B: - Jun/14/1999.',
'EFGH , DOB; - Jan/10/1998,',
'IJKL , D-O-B - Jul/15/1985..',
'MNOP , (DOB)* - Dec/21/1999,',
'QRST , *DOB* - Apr/01/2000.',
'UVWX , D O B, - Feb/11/2001 '
]

I would like this to be in the following formatted output form -

OUTPUT = [
('ABCD, Jun/14/1999'),
('EFGH, Jan/10/1998'),
('IJKL, Jul/15/1985'),
('MNOP, Dec/21/1999'),
('QRST, Apr/1/2000'),
('UVWX, Feb/11/2001')
]

I tried the following code which works partly but I am unable to do the formatting in the desired OUTPUT format -

import re

INPUT = [
'ABCD , D.O.B: - Jun/14/1999.',
'EFGH , DOB; - Jan/10/1998,',
'IJKL , D-O-B - Jul/15/1985..',
'MNOP , (DOB)* - Dec/21/1999,',
'QRST , *DOB* - Apr/01/2000.',
'UVWX , D O B, - Feb/11/2001 '
]


def formatted_def(input):
    for n in input:
        t = re.sub('[^a-zA-Z0-9 ]+','',n).split('DOB')
        print(t)


formatted_def(INPUT)

Output -

['ABCD  ', '  Jun141999']
['EFGH  ', '  Jan101998']
['IJKL  ', '  Jul151985']
['MNOP  ', '  Dec211999']
['QRST  ', '  Apr012000']
['UVWX  D O B  Feb112001 ']

Any pointers will be very helpful. Thanks in advance!

Upvotes: 1

Views: 109

Answers (4)

Valdi_Bo
Valdi_Bo

Reputation: 30971

The main difficult point is to get ('ABCD, Jun/14/1999'), content.

It can not be a single-element tuple, as it would have been printed as ('ABCD, Jun/14/1999',), (note extra , before the )).

So to get exactly the result you wanted, I did it using a series of print statements.

The whole script (in Python 3) can be as follows:

import re
input = [
  'ABCD , D.O.B: - Jun/14/1999.',
  'EFGH , DOB; - Jan/10/1998,',
  'IJKL , D-O-B - Jul/15/1985..',
  'MNOP , (DOB)* - Dec/21/1999,',
  'QRST , *DOB* - Apr/01/2000.',
  'UVWX , D O B, - Feb/11/2001 '
]
result = [ re.sub(r'^([a-z]+).*? - ([a-z]{3}/\d{2}/\d{4}).*',
                  r'\1, \2', txt, flags = re.IGNORECASE) for txt in input ]
print('OUTPUT = [')
for txt in result:
    print(" ('{}')".format(txt))
print(']')

Upvotes: 0

Sunitha
Sunitha

Reputation: 12005

import re
re.findall(r'(\w+)\s+,.*?-\s+([^., ]*)', ' '.join(INPUT))
# [('ABCD', 'Jun/14/1999'), ('EFGH', 'Jan/10/1998'), ('IJKL', 'Jul/15/1985'), ('MNOP', 'Dec/21/1999'), ('QRST', 'Apr/01/2000'), ('UVWX', 'Feb/11/2001')]

Upvotes: 2

jspcal
jspcal

Reputation: 51894

In addition to the other answer, you can also use re.sub:

INPUT = [
    'ABCD , D.O.B: - Jun/14/1999.',
    'EFGH , DOB; - Jan/10/1998,',
    'IJKL , D-O-B - Jul/15/1985..',
    'MNOP , (DOB)* - Dec/21/1999,',
    'QRST , *DOB* - Apr/01/2000.',
    'UVWX , D O B, - Feb/11/2001 '
]

pattern = r'(?i)^([a-z]+).*([a-z]{3}/\d{2}/\d{4}).*$'

OUTPUT = [re.sub(pattern, r'\1, \2', x) for x in INPUT]

# OUTPUT:

[
    'ABCD, Jun/14/1999',
    'EFGH, Jan/10/1998',
    'IJKL, Jul/15/1985',
    'MNOP, Dec/21/1999',
    'QRST, Apr/01/2000',
    'UVWX, Feb/11/2001'
]

Upvotes: 2

Ajax1234
Ajax1234

Reputation: 71451

You can use re.findall:

import re
l = ['ABCD , D.O.B: - Jun/14/1999.', 'EFGH , DOB; - Jan/10/1998,', 'IJKL , D-O-B - Jul/15/1985..', 'MNOP , (DOB)* - Dec/21/1999,', 'QRST , *DOB* - Apr/01/2000.', 'UVWX , D O B, - Feb/11/2001 ']
final_data = [', '.join(re.findall('^\w+|[a-zA-Z]+/\d+/\d+(?=\W)', i)) for i in l]

Output:

['ABCD, Jun/14/1999', 'EFGH, Jan/10/1998', 'IJKL, Jul/15/1985', 'MNOP, Dec/21/1999', 'QRST, Apr/01/2000', 'UVWX, Feb/11/2001']

Upvotes: 2

Related Questions