Reputation: 1476

How to improve a python regex data file parsing efficiency?

I have data files like this:

group Head:
  data1:        abc         data2:            def
  2word data3:  ghi         data4:            jkl
  data3:        mno         three word data4: pqr stu

So in python i built a regex expression like this:

Data = re.findall(r'(([\w\(\)]+[ \t\f]?)+):([ \t\f]*(\S+))', data)

My files are near 600 lines, often with 2 columns as shown above, and parsing them takes several minutes per file.

What would be the best way to make this code more efficient so it can run in less than 10 seconds per file?

Upvotes: 0

Answers (4)

user557597

Reputation:

This might take a shorter time

 # ([\w()](?:[^\S\r\n]?[\w()]+)*)[^\S\r\n]*:[^\S\r\n]*([\w()](?:[^\S\r\n]?[\w()]+)*)

 (                                 # (1) Key
      [\w()] 
      (?: [^\S\r\n]? [\w()]+ )*
 )
 [^\S\r\n]* : [^\S\r\n]* 
 (                                 # (2) Value
      [\w()] 
      (?: [^\S\r\n]? [\w()]+ )*
 )

Upvotes: 1

perreal

Reputation: 97918

import re

data = """group Head:
  data1: abc         data2: def
  2word data3: ghi   data4: jkl
  data3: mno         three word data4: pqr stu"""

for l in data.split('\n'):
    print [ x.split(':') for x in re.split('\s\s+', l) if x ]

Gives:

[['group Head', '']]
[['data1', ' abc'], ['data2', ' def']]
[['2word data3', ' ghi'], ['data4', ' jkl']]
[['data3', ' mno'], ['three word data4', ' pqr stu']]

Upvotes: 2