윤제균
윤제균

Reputation: 159

Python regular expression to split parameterized text file

I'm trying to split a file that contains 'string = float' format repeatedly. Below is how the file looks like.

+name1 = 32    name2= 4
+name3 = 2     name4 = 5
+name5 = 2e+23
...  

And I want them to put it a dictionary. Like...

a={name1:32, name2:4, name3:2, name4:5, name5:2e+23}

I'm new to regular expression and having a hard time figuring out what to do. After some googling, I tried to do as below to remove "+" character and white space..

p=re.compile('[^+\s]+')
splitted_list=p.findall(lineof_file)

But this gave me two problems.. 1. when there is no space btw name and "=" sign, it doesn't spilit. 2. for numbers like 2e+23, it split the + sign in between.

I managed to parse the file as I wanted after some modification of depperm's code.
But I'm facing another problem. To better explain my problems. Below is how my file can look like. After + sign multiple parameter and value pair can appear with '=' sign. The parameter name can contain alphabet and number in any position. Also value can contain +- sign with scientific notification(E/e-+). And sometimes value can be a math expression if it is single quoted.

+ abc2dfg3  = -2.3534E-03    dfe4c3= 2.000
+ abcdefg= '1.00232e-1*x' * bdfd=1e-3

I managed to parse the above using the below regex.

re.findall("(\w+)\s*=\s*([+-]?[\d+.Ee+-]+|'[^']+')",eachline)

But now my problem is sometimes like "* bdfd=1e-3", there could be some comment. Anything after *(asterisk) in my file should be treated as comment but not if * present inside single quoted string. With above regex, it parses "bdfd=1e-3" as well but I want it to be not parsed. I tried to find solution for hours but I couldn't find any solution so far.

Upvotes: 1

Views: 113

Answers (3)

Patrick Artner
Patrick Artner

Reputation: 51683

You can combine regex with string splitting:

Create the file:

t =""" 

+name1 = 32    name2= 4
+name3 = 2     name4 = 5
+name5 = 2e+23"""

fn = "t.txt"
with open(fn,"w") as f:
    f.write(t)

Split the file:

import re
d = {}
with open(fn,"r") as f:
    for line in f:    # proces each line
        g = re.findall(r'(\w+ ?= ?[^ ]*)',line)    # find all name = something
        for hit in g:                              # something != space
            hit = hit.strip()                      # remove spaces
            if hit:
                key, val = hit.split("=")          # split and strip and convert  
                d[key.rstrip()] = float(val.strip())   # put into dict
print d

Output:

{'name4': 5.0, 'name5': 2e+23, 'name2': 4.0, 'name3': 2.0, 'name1': 32.0}

Upvotes: 1

Kenny Alvizuris
Kenny Alvizuris

Reputation: 445

You don't need a regular expression to accomplish your goal. You can use built-in Python methods.

your_dictionary = {}
# Read the file 
with open('file.txt','r') as fin:
  lines = fin.readlines()
# iterate over each line
for line in lines:
  splittedLine = line.split('=')
  your_dictionary.push({dict.push({
  key:   splittedLine[0],
  value: splittedLine[1]
});
print(your_dictionary)

Hope it helps!

Upvotes: 1

depperm
depperm

Reputation: 10756

I would suggest just grabbing the name and the value instead of worrying about the spaces or unwanted characters. I'd use this regex: (name\d+)\s?=\s?([\de+]+) which will get the name and then you also group the number even if it has an e or space.

import re
p=re.compile('(name\d+)\s*=\s*([\de+]+)')

a ={}
with open("file.txt", "r") as ins:
    for line in ins:
        splitted_list=p.findall(line)
        #splitted_list looks like: [('name1', '32'), ('name2', '4')]
        for group in splitted_list:
            a[group[0]]=group[1]
print(a)
#{'name1': '32', 'name2': '4', 'name3': '2', 'name4': '5', 'name5': '2e+23'}

Upvotes: 1

Related Questions