Reputation: 934
I'm gonna need to upload a potentially large csv file into my application. Each section of that file is indicated by a #TYPE *
. How should I go about splitting it into chunks and doing further processing on each chunk? Each chunk is a list of headers followed by all the values.
Right now I have written the processing for a single chunk but I'm not sure how to do the operation for each chunk. I think that a regex operation would be the best option because of the constant return of #TYPE *
.
#TYPE Lorem.Text.A
...
#TYPE Lorem.Text.B
...
#TYPE Lorem.Text.C
...
UPDATE
This solution has been changed from saving all sections in one file to saving all sections to separate files and zipping them into a zip file. This zip file is read by python and further analyzed. If someone would be interested in that explanation message me and I'll update this question.
Answer from @Padraic was the most helpful for the old course.
Upvotes: 3
Views: 3516
Reputation: 180391
You could use a groupby presuming the sections are delimited by lines starting with #TYPE
:
from itertools import groupby, chain
def get_sections(fle):
with open(fle) as f:
grps = groupby(f, key=lambda x: x.lstrip().startswith("#TYPE"))
for k, v in grps:
if k:
yield chain([next(v)], (next(grps)[1])) # all lines up to next #TYPE
You can get each section as you iterate:
In [13]: cat in.txt
#TYPE Lorem.Text.A
first
#TYPE Lorem.Text.B
second
#TYPE Lorem.Text.C
third
In [14]: for sec in get_sections("in.txt"):
....: print(list(sec))
....:
['#TYPE Lorem.Text.A\n', 'first\n']
['#TYPE Lorem.Text.B\n', 'second\n']
['#TYPE Lorem.Text.C\n', 'third\n']
If no other lines start with #
then that alone will be enough to use in startswith, there is nothing complicated in your pattern so it is not really a use case for a regex. This also only stores a section at a time not the whole file into memory.
If you have no leading whitespace and the only place #
appears is before TYPE it may be sufficient to just call groupby:
from itertools import groupby, chain
def get_sections(fle):
with open(fle) as f:
grps = groupby(f)
for k, v in grps:
if k:
yield chain([next(v)], (next(grps)[1])) # all lines up to next #TYPE
If there was some metadata at the start you could use dropwhile to skip lines until we hit the #Type
and then just group:
from itertools import groupby, chain, dropwhile
def get_sections(fle):
with open(fle) as f:
grps = groupby(dropwhile(lambda x: not x.startswith("#"), f))
for k, v in grps:
if k:
yield chain([next(v)], (next(grps)[1])) # all lines up to next #TYPE
Demo:
In [16]: cat in.txt
meta
more meta
#TYPE Lorem.Text.A
first
#TYPE Lorem.Text.B
second
second
#TYPE Lorem.Text.C
third
In [17]: for sec in get_sections("in.txt"):
print(list(sec))
....:
['#TYPE Lorem.Text.A\n', 'first\n']
['#TYPE Lorem.Text.B\n', 'second\n', 'second\n']
['#TYPE Lorem.Text.C\n', 'third\n']
Upvotes: 4
Reputation: 174696
Do splitting according to the new line char exists before #TYPE
chunks = re.split(r'\n(?=#TYPE\b *)', f.read())
Example:
>>> import re
>>> s = '''#TYPE Lorem.Text.A
...
#TYPE Lorem.Text.B
...
#TYPE Lorem.Text.C
...'''
>>> re.split(r'\n(?=#TYPE *)', s)
['#TYPE Lorem.Text.A\n...', '#TYPE Lorem.Text.B\n...', '#TYPE Lorem.Text.C\n...']
>>>
Upvotes: -1