Quinma
Quinma

Reputation: 1476

How to improve a python regex data file parsing efficiency?

I have data files like this:

group Head:
  data1:        abc         data2:            def
  2word data3:  ghi         data4:            jkl
  data3:        mno         three word data4: pqr stu

So in python i built a regex expression like this:

Data = re.findall(r'(([\w\(\)]+[ \t\f]?)+):([ \t\f]*(\S+))', data)

My files are near 600 lines, often with 2 columns as shown above, and parsing them takes several minutes per file.

What would be the best way to make this code more efficient so it can run in less than 10 seconds per file?

Upvotes: 0

Views: 121

Answers (4)

user557597
user557597

Reputation:

This might take a shorter time

 # ([\w()](?:[^\S\r\n]?[\w()]+)*)[^\S\r\n]*:[^\S\r\n]*([\w()](?:[^\S\r\n]?[\w()]+)*)

 (                                 # (1) Key
      [\w()] 
      (?: [^\S\r\n]? [\w()]+ )*
 )
 [^\S\r\n]* : [^\S\r\n]* 
 (                                 # (2) Value
      [\w()] 
      (?: [^\S\r\n]? [\w()]+ )*
 )

Upvotes: 1

perreal
perreal

Reputation: 97918

import re

data = """group Head:
  data1: abc         data2: def
  2word data3: ghi   data4: jkl
  data3: mno         three word data4: pqr stu"""

for l in data.split('\n'):
    print [ x.split(':') for x in re.split('\s\s+', l) if x ] 

Gives:

[['group Head', '']]
[['data1', ' abc'], ['data2', ' def']]
[['2word data3', ' ghi'], ['data4', ' jkl']]
[['data3', ' mno'], ['three word data4', ' pqr stu']]

Upvotes: 2

Eevee
Eevee

Reputation: 48536

You're nesting repetition operators and might be getting exponential backtracking.

Try this instead:

r'(\S.+)\s*:\s*(\S+)'

Non-whitespace followed by anything else, a colon with optional whitespace around it, and some more non-whitespace.

Upvotes: 2

andrewgrz
andrewgrz

Reputation: 423

Pre-compile your regex. Docs.

If possible, split your files and parse line by line.

Both should help reduce your times.

Upvotes: 0

Related Questions