Reputation: 1476
I have data files like this:
group Head:
data1: abc data2: def
2word data3: ghi data4: jkl
data3: mno three word data4: pqr stu
So in python i built a regex expression like this:
Data = re.findall(r'(([\w\(\)]+[ \t\f]?)+):([ \t\f]*(\S+))', data)
My files are near 600 lines, often with 2 columns as shown above, and parsing them takes several minutes per file.
What would be the best way to make this code more efficient so it can run in less than 10 seconds per file?
Upvotes: 0
Views: 121
Reputation:
This might take a shorter time
# ([\w()](?:[^\S\r\n]?[\w()]+)*)[^\S\r\n]*:[^\S\r\n]*([\w()](?:[^\S\r\n]?[\w()]+)*)
( # (1) Key
[\w()]
(?: [^\S\r\n]? [\w()]+ )*
)
[^\S\r\n]* : [^\S\r\n]*
( # (2) Value
[\w()]
(?: [^\S\r\n]? [\w()]+ )*
)
Upvotes: 1
Reputation: 97918
import re
data = """group Head:
data1: abc data2: def
2word data3: ghi data4: jkl
data3: mno three word data4: pqr stu"""
for l in data.split('\n'):
print [ x.split(':') for x in re.split('\s\s+', l) if x ]
Gives:
[['group Head', '']]
[['data1', ' abc'], ['data2', ' def']]
[['2word data3', ' ghi'], ['data4', ' jkl']]
[['data3', ' mno'], ['three word data4', ' pqr stu']]
Upvotes: 2
Reputation: 48536
You're nesting repetition operators and might be getting exponential backtracking.
Try this instead:
r'(\S.+)\s*:\s*(\S+)'
Non-whitespace followed by anything else, a colon with optional whitespace around it, and some more non-whitespace.
Upvotes: 2