user3685687
user3685687

Reputation: 63

Extract Numeric Data from a Text file in Python

Say I have a text file with the data/string:

Dataset #1: X/Y= 5, Z=7 has been calculated
Dataset #2: X/Y= 6, Z=8 has been calculated
Dataset #10: X/Y =7, Z=9 has been calculated 

I want the output to be on a csv file as:

X/Y, X/Y, X/Y

Which should display:

5, 6, 7

Here is my current approach, I am using string.find, but I feel like this is rather difficult in solving this problem:

data = open('TestData.txt').read()
#index of string
counter = 1

if (data.find('X/Y=')==1):      
#extracts segment out of string
    line = data[r+6:r+14]
    r = data.find('X/Y=')
    counter += 1 
    print line
else: 
    r = data.find('X/Y')`enter code here`
    line = data[r+6:r+14]
    for x in range(0,counter):
    print line


print counter

Error: For some reason, I'm only getting the value of 5. when I setup a #loop, i get infinite 5's.

Upvotes: 2

Views: 4324

Answers (2)

Padraic Cunningham
Padraic Cunningham

Reputation: 180411

If you want the numbers and your txt file is formatted like the first two lines i.e X/Y= 6, not like X/Y =7:

import re
result=[]
with open("TestData.txt") as f:
    for line in f:
        s = re.search(r'(?<=Y=\s)\d+',line) # pattern matches up to "Y" followed by "=" and a space "\s" then a digit or digits. 
        if s: # if there is a match i.e re.search does not return None, add match to the list.
            result.append(s.group())
print result
['5', '6', '7']

To match the pattern in your comment, you should escape the period like . or you will match strings like 1.2+3 etc.. the "." has special meaning re.

So re.search(r'(?<=Counting Numbers =\s)\d\.\d\.\d',s).group() will return only 1.2.3

If it makes it more explicit, you can use s=re.search(r'(?<=X/Y=\s)\d+',line) using the full X/Y=\s pattern.

Using the original line in your comment and updated line would return :

['5', '6', '7', '5', '5']

The (?<=Y=\s)is called a positive lookbehind assertion.

(?<=...)

Matches if the current position in the string is preceded by a match for ... that ends at the current position

There are lots of nice examples here in the re documentation. The items in the parens are not returned.

Upvotes: 3

khampson
khampson

Reputation: 15306

Since it appears that the entities are all on a single line, I would recommend using readline in a loop to read the file line-by-line and then using a regex to parse out the components you're looking for from that line.

Edit re: OP's comment:

One regex pattern that could be used to capture the number given the specified format in this case would be: X/Y\s*=\s*(.+),

Upvotes: 1

Related Questions