Ignacio Garcia
Ignacio Garcia

Reputation: 73

Splitting contents in textfile

I have a text file that contains the following:

Number1 (E, P) (F, H)
Number2 (A, B) (C, D)
Number3 (I, J) (O, Z) 

I know more or less how to read it and how to get the values of it into my program, but I wanted to know how to correctly split into "Number 1", "(E,P)" and "(F, H)". Also later, I want to be able to check in my program if "Number1" contains "(E, P)" or not.

def read_srg(name):
    filename = name + '.txt'
    fp = open(filename)
    lines = fp.readlines()

    R = {}
    for line in lines:
        ??? = line.split()

    fp.close()

    return R

Upvotes: 0

Views: 73

Answers (3)

mapofemergence
mapofemergence

Reputation: 458

Because of the whitespaces within the parentheses, you better go with a regular expression, than just splitting lines.

Here's your read_srg function, with the regex check integrated:

import re

def read_srg(name):
    with open('%s.txt' % (name, ), 'r') as text:
        matchstring = r'(Number[0-9]+) (\([A-Z,\s]+\)) (\([A-Z,\s]+\))'
        R = {}
        for i, line in enumerate(text):
            match = re.match(matchstring, line)
            if not match:
                print 'skipping exception found in line %d: %s' % (i + 1, line)
                continue
            key, v1, v2 = match.groups()
            R[key] = v1, v2
        return R

from pprint import pformat
print pformat(read_srg('example'))

To read your dictionary and perform checks on keys and values, you can later do something like:

test_dict = read_srg('example')
for key, (v1, v2) in test_dict.iteritems():
    matchstring = ''
    if 'Number1' in key and '(E, P)' in v1:
        matchstring = 'match found: '
    print '%s%s > %s %s' % (matchstring, key, v1, v2)

A big advantage of this approach is that you can also use your regex to check that your file isn't malformed for some reason. This is why the matching rule is quite strict:

matchstring = r'(Number[0-9]+) (\([A-Z,\s]+\)) (\([A-Z,\s]+\))'

  • (Number[0-9]+) will match only words made of Number followed by any number of digits
  • (\([A-Z,\s]+\)) will match only strings enclosed into () which contain capital letters or , or a whitespace

I read in your comment that the format of the file is always the same, so I'm assuming it is procedurally generated. Still, you might want to check its integrity (or to be sure that your code does not break if at some point the procedure generating the txt file changes its formatting). Depending how strict you want your sanity check to be, you can push the above even further:

  • if you know there should never be more than 3 digits after Number, you might change (Number[0-9]+) to (Number[0-9]{1,3}) (which limits the match to 1, 2 or 3 digits)
  • if you want to be sure the format in parentheses is always two single capital letters split by ", " you can change (\([A-Z,\s]+\)) to (\([A-Z], [A-Z]\))

Upvotes: 1

Tim Pietzcker
Tim Pietzcker

Reputation: 336108

I think the easiest/most reliable way would be to use a regex:

import re
regex = re.compile(r"([^()]*) (\([^()]*\)) (\([^()]*\))")
with open("myfile.txt") as text:
   for line in text:
       contents = regex.match(line)
       if contents:
           label, g1, g2 = contents.groups()
           # now do something with these values, e. g. add them to a list

Explanation:

([^()]*)      # Match any number of characters besides parentheses --> group 1
[ ]           # Match a space
(\([^()]*\))  # Match (, then any non-parenthesis characters, then ) --> group 2
[ ]           # Match a space
(\([^()]*\))  # Match (, then any non-parenthesis characters, then ) --> group 3

Upvotes: 6

Ma0
Ma0

Reputation: 15204

You were really close. Try this:

def read_srg(name):    
    with open(name + '.txt', 'r') as f:
        R = {}
        for line in f:
            line = line.replace(', ', ',')  # Number1 (E, P) (F, H) -> Number1 (E,P) (F,H)
            header, *contents = line.strip().split()  # `header` gets the first item of the list and all the rest go to `contents`
            R[header] = contents
    return R

Checking for membership can be later done like so:

if "(E,P)" in R["Number1"]:
    # do stuff

I did not test this but it should be fine. Let me know if anything comes up.

Upvotes: 0

Related Questions