Reputation: 399
I am trying to create a list of personnel through the following regex code:
list_of_electricians = re.findall(r'\d*\.<(\d*)<([\w+ ]*)<"([^"]*)"<"([^"]*)"', csvFile1.read(), re.S)
csvFile2 = open(r'C:\\Users\\Admin\\SkyDrive\\eCommerce\\Servi-fied\\Raw Data\\EMA - Electricians (ReProcessed).csv', 'w+')
writer2 = csv.writer(csvFile2, delimiter=';')
for item in list_of_electricians:
writer2.writerow(item)
The data that I am trying to extract is in the string as follows:
1.<7059184<ABDUL HALIM M<"ABDUL HALIM M
639 #24-98
ROWELL ROAD
200639"<"62971924(Tel)
93632009(Hp)"
2.<7055147<ABDULLAH SUNNY BIN ALI<"SINGAPORE MRT LTD
251
NORTH BRIDGE ROAD
179102"<"65476617(Tel)
96814905(Hp)"
3.<7063254<ANG CHUI POH<"AKP INDUSTRIES PTE LTD
8B #05-08
ADMIRALTY STREET
757440"<"64811528(Tel)
93890779(Hp)"
Any suggestions as to how I should go about changing the regex code so that all the newlines are ignored? I understand that I could remove all the "\n" or newline characters before running the regex. However, I need those lines later on so that it is easier to process the addresses.
At the end of the day, I am looking at creating a csv file with the data separated into license number, name, address and phone numbers.
Thanks!
Upvotes: 1
Views: 768
Reputation: 33126
That regex is a bit overly complex. This uses a simpler regex and keeps the lines less than 80 characters long (PEP 8):
list_of_electricians = \
re.findall(r'.*?<(.*?)<(.*?)<"(.*?)"<"(.*?)"', csvFile1.read(), re.S)
The above will still capture the newlines and multiple spaces. One way to get rid of them is to rebuild the list after the fact:
for i,x in enumerate(list_of_electricians) :
list_of_electricians[i] = [' '.join(y.split()) for y in x]
Another way to get rid of them is to use list comprehensions so as to eliminate them from the very start:
list_of_electricians = \
[[' '.join(x.split()) for x in y] \
for y in \
re.findall(r'.*?<(.*?)<(.*?)<"(.*?)"<"(.*?)"', csvFile1.read(), re.S)]
Upvotes: 0
Reputation: 95742
Why not just use csv.reader
and avoid the regex altogether?:
>>> infile = StringIO(data)
>>> rdr = csv.reader(infile, delimiter="<")
>>> for row in rdr: print(row)
['1.', '7059184', 'ABDUL HALIM M', 'ABDUL HALIM M\n 639 #24-98\n ROWELL ROAD\n 200639', '62971924(Tel)\n 93632009(Hp)']
[]
['2.', '7055147', 'ABDULLAH SUNNY BIN ALI', 'SINGAPORE MRT LTD\n 251\n NORTH BRIDGE ROAD\n 179102', '65476617(Tel)\n 96814905(Hp)']
[]
['3.', '7063254', 'ANG CHUI POH', 'AKP INDUSTRIES PTE LTD\n 8B #05-08\n ADMIRALTY STREET\n 757440', '64811528(Tel)\n 93890779(Hp)']
>>>
Upvotes: 0
Reputation: 1570
The code that you have should give you an array of tuples that you can iterate by.
That means that your variable list_of_electricians
will have something like this:
[('1',
'7059184',
'ABDUL HALIM M',
"ABDUL HALIM M 639 #24-98 ROWELL ROAD 200639"),
('2',
'7055147',
'ABDULLAH SUNNY BIN ALI',
"SINGAPORE MRT LTD 251 NORTH BRIDGE ROAD 179102"]
that you can iterate by using a typically a for loop
Hope that helps
Upvotes: 0
Reputation: 715
Your regular expression is pretty hard for me to parse in my brain, so bear with me. I might even try using string splitting with the chosen delimiters in this case, because it's pretty complicated
One tool that's pretty helpful for this sort of thing is http://pythex.org
Anyways, adding [] around the " magically fixes it. Don't ask me why.
\d*\.<(\d*)<([\w+ ]*)<"([^"]*)["]<"([^"]*)"
/\
here
Upvotes: 1