Shayan
Shayan

Reputation: 548

Loop in a REGEX search

I have some .csv files that needs to be parsed and I got stuck in one part that needs to be separated in different rows. to make it clear:

this is the sample csv:

004  000000,Y 
005  000000,N 
006  000000,N
007 A000000,Y
007 B000000,16
007 C010100,1
007 C020100,XTF ADVISORS TRUST - ETF 2010 PORTFOLIO
007 C030100,Y
007 C010200,2
007 C020200,XTF ADVISORS TRUST - ETF 2015 PORTFOLIO
007 C030200,Y
007 C010300,3
007 C020300,XTF ADVISORS TRUST - ETF 2020 PORTFOLIO
007 C030300,Y
007 C010400,4
007 C020400,XTF ADVISORS TRUST - ETF 2025 PORTFOLIO
007 C030400,Y
007 C010500,5
007 C020500,XTF ADVISORS TRUST - ETF 2030 PORTFOLIO
007 C030500,Y
007 C010600,6

Python Code for this part that returns the number of sections and the line related to 007 A000000 is :

def haveSeries(csvfile):
with open(csvfile, 'rb') as f:
    reader = csv.reader(f)
    row2 = 0
    for row in reader:
        if (row[0] == '007 A000000') and (row[1]=='Y'):
            baseline = reader.line_num
            print baseline
            seriesnum = reader.next()
            print seriesnum[1]
            return (baseline,seriesnum[1])

It returns 16 for the above example so we have 16 categories. so now I should make another csv that has all the [Key, Values] till [007 A000000,Y] in all the rows and on the next columns of each row the data related to the category number. the categories are numbered in the keys like

086 D020000,0
086 E010000,0
086 E020000,0
086 F010000,0
086 F020000,0
024  000100,N
025 D000101,0
025 D000102,0
025 D000103,0
025 D000104,0
025 D000105,0
025 D000106,0
025 D000107,0
***... Category 1 starts at 024 000100 ...***
075 A000100,0
075 B000100,0
076  000100,0.00
024  000200,N
025 D000201,0
025 D000202,0
025 D000203,0
025 D000204,0
025 D000205,0
***... category 2 starts at 024 000200... and so on***

so the REGEX to identify these would be something like \d{3}( \w| )\d{3}X\d.{,} that for X I have to iterate for 1 to 16 and have different rows for each category.

the code that I wrote for this part:

if haveSeries(csvfile) != False:
        seriesBaseNNum=haveSeries(csvfile)
        # TODO write all the lines from 1 to baseline again
        for row in reader:
           for i in xrange(1,int(seriesBaseNNum[1])):
                i= u'%02d' % i # two digits
                seriesi = re.compile ("\d{3}( \w|  )\d{3}%s\d.{,}" % i) #err on %d so changed to %s
                matchers = seriesi.search(row[0])
                if matchers:
                    print matchers.**group(0)**

but I get an output like this:

074 T000100
074 U010100
074 U020100
074 V010100
074 V020100
074 W000100
074 X000100
074 Y000100
075 A000100
075 B000100
076  000100
024  001100
025 D001101
025 D001102
025 D001103
025 D001104
025 D001105
025 D001106
025 D001107
025 D001108
028 A011100
028 A021100
028 A031100
028 A041100
028 B011100
028 B021100
028 B031100
028 B041100
028 C011100
028 C021100
...

so It only iterates once and on i=1 (and by chance i=11, I mean when the %s is 1 and the character before it is 1 too)

  1. How can I do the iteration on the Regex to find all the matches for i=1 to 16 in this example?
  2. How should I implement the part that has to write the first n column for all categories and write the rest in the next columns of each row?

Upvotes: 0

Views: 203

Answers (2)

Shayan
Shayan

Reputation: 548

the problem was with the Regex, the line for condition should have been this:

seriesi = re.compile ("\d{3}( \w|  )\d{2}%02d\d{2}.{,}" %i)

and then one for loop from one to the number of categories to write the 1 to seriesBaseNNum[0] and another to write the categories into each row in the cvs.

thanks for the help though.

Upvotes: 0

Julio
Julio

Reputation: 2290

Your matchers variable is a Match Object. Per the documentation you can access the results via group.

>>> import re
>>> m = re.search('(?<=abc)def', 'abcdef')
>>> m.group(0)
'def'

Upvotes: 2

Related Questions