Cloud
Cloud

Reputation: 399

Python (Regex): How do you get Python to ignore all the newlines in between the string pattern you are trying match?

I am trying to create a list of personnel through the following regex code:

list_of_electricians = re.findall(r'\d*\.<(\d*)<([\w+ ]*)<"([^"]*)"<"([^"]*)"', csvFile1.read(), re.S)
csvFile2 = open(r'C:\\Users\\Admin\\SkyDrive\\eCommerce\\Servi-fied\\Raw Data\\EMA - Electricians (ReProcessed).csv', 'w+')
writer2 = csv.writer(csvFile2, delimiter=';')

for item in list_of_electricians:
    writer2.writerow(item)

The data that I am trying to extract is in the string as follows:

1.<7059184<ABDUL HALIM M<"ABDUL HALIM M
                                  639 #24-98
                                 ROWELL ROAD
                        200639"<"62971924(Tel)
                   93632009(Hp)"

2.<7055147<ABDULLAH SUNNY BIN ALI<"SINGAPORE MRT LTD
                                  251
                                 NORTH BRIDGE ROAD
                        179102"<"65476617(Tel)
                   96814905(Hp)"

3.<7063254<ANG CHUI POH<"AKP INDUSTRIES PTE LTD
                                  8B #05-08
                                 ADMIRALTY STREET
                        757440"<"64811528(Tel)
                   93890779(Hp)"

Any suggestions as to how I should go about changing the regex code so that all the newlines are ignored? I understand that I could remove all the "\n" or newline characters before running the regex. However, I need those lines later on so that it is easier to process the addresses.

At the end of the day, I am looking at creating a csv file with the data separated into license number, name, address and phone numbers.

Thanks!

Upvotes: 1

Views: 768

Answers (4)

David Hammen
David Hammen

Reputation: 33126

That regex is a bit overly complex. This uses a simpler regex and keeps the lines less than 80 characters long (PEP 8):

list_of_electricians = \ 
    re.findall(r'.*?<(.*?)<(.*?)<"(.*?)"<"(.*?)"', csvFile1.read(), re.S)

The above will still capture the newlines and multiple spaces. One way to get rid of them is to rebuild the list after the fact:

for i,x in enumerate(list_of_electricians) :
    list_of_electricians[i] = [' '.join(y.split()) for y in x]

Another way to get rid of them is to use list comprehensions so as to eliminate them from the very start:

list_of_electricians = \ 
    [[' '.join(x.split()) for x in y] \
     for y in \
     re.findall(r'.*?<(.*?)<(.*?)<"(.*?)"<"(.*?)"', csvFile1.read(), re.S)]

Upvotes: 0

Duncan
Duncan

Reputation: 95742

Why not just use csv.reader and avoid the regex altogether?:

>>> infile = StringIO(data)
>>> rdr = csv.reader(infile, delimiter="<")
>>> for row in rdr: print(row)

['1.', '7059184', 'ABDUL HALIM M', 'ABDUL HALIM M\n                                  639 #24-98\n                                 ROWELL ROAD\n                        200639', '62971924(Tel)\n                   93632009(Hp)']
[]
['2.', '7055147', 'ABDULLAH SUNNY BIN ALI', 'SINGAPORE MRT LTD\n                                  251\n                                 NORTH BRIDGE ROAD\n                        179102', '65476617(Tel)\n                   96814905(Hp)']
[]
['3.', '7063254', 'ANG CHUI POH', 'AKP INDUSTRIES PTE LTD\n                                  8B #05-08\n                                 ADMIRALTY STREET\n                        757440', '64811528(Tel)\n                   93890779(Hp)']
>>> 

Upvotes: 0

David Lemon
David Lemon

Reputation: 1570

The code that you have should give you an array of tuples that you can iterate by.

That means that your variable list_of_electricians will have something like this:

[('1',
'7059184',
'ABDUL HALIM M',
"ABDUL HALIM M 639 #24-98  ROWELL ROAD 200639"),
('2', 
'7055147', 
'ABDULLAH SUNNY BIN ALI',
"SINGAPORE MRT LTD    251  NORTH BRIDGE ROAD 179102"]

that you can iterate by using a typically a for loop

Hope that helps

Upvotes: 0

Devin Howard
Devin Howard

Reputation: 715

Your regular expression is pretty hard for me to parse in my brain, so bear with me. I might even try using string splitting with the chosen delimiters in this case, because it's pretty complicated

One tool that's pretty helpful for this sort of thing is http://pythex.org

Anyways, adding [] around the " magically fixes it. Don't ask me why.

\d*\.<(\d*)<([\w+ ]*)<"([^"]*)["]<"([^"]*)"
                              /\
                             here

Upvotes: 1

Related Questions