Adrian Z.
Adrian Z.

Reputation: 934

How to split file into chunks by string delimiter in Python

I'm gonna need to upload a potentially large csv file into my application. Each section of that file is indicated by a #TYPE *. How should I go about splitting it into chunks and doing further processing on each chunk? Each chunk is a list of headers followed by all the values.

Right now I have written the processing for a single chunk but I'm not sure how to do the operation for each chunk. I think that a regex operation would be the best option because of the constant return of #TYPE *.

#TYPE Lorem.Text.A
...
#TYPE Lorem.Text.B
...
#TYPE Lorem.Text.C
...

UPDATE

This solution has been changed from saving all sections in one file to saving all sections to separate files and zipping them into a zip file. This zip file is read by python and further analyzed. If someone would be interested in that explanation message me and I'll update this question.

Answer from @Padraic was the most helpful for the old course.

Upvotes: 3

Views: 3516

Answers (2)

Padraic Cunningham
Padraic Cunningham

Reputation: 180391

You could use a groupby presuming the sections are delimited by lines starting with #TYPE:

from itertools import groupby, chain


def get_sections(fle):
    with open(fle) as f:
        grps = groupby(f, key=lambda x: x.lstrip().startswith("#TYPE"))
        for k, v in grps:
            if k:
                yield chain([next(v)], (next(grps)[1]))  # all lines up to next #TYPE

You can get each section as you iterate:

In [13]: cat in.txt
#TYPE Lorem.Text.A
first
#TYPE Lorem.Text.B
second
#TYPE Lorem.Text.C
third

In [14]: for sec in get_sections("in.txt"):
   ....:     print(list(sec))
   ....:     
['#TYPE Lorem.Text.A\n', 'first\n']
['#TYPE Lorem.Text.B\n', 'second\n']
['#TYPE Lorem.Text.C\n', 'third\n']

If no other lines start with # then that alone will be enough to use in startswith, there is nothing complicated in your pattern so it is not really a use case for a regex. This also only stores a section at a time not the whole file into memory.

If you have no leading whitespace and the only place # appears is before TYPE it may be sufficient to just call groupby:

from itertools import groupby, chain


def get_sections(fle):
    with open(fle) as f:
        grps = groupby(f)
        for k, v in grps:
            if k:
                yield chain([next(v)], (next(grps)[1]))  # all lines up to next #TYPE

If there was some metadata at the start you could use dropwhile to skip lines until we hit the #Type and then just group:

from itertools import groupby, chain, dropwhile


def get_sections(fle):
    with open(fle) as f:
        grps = groupby(dropwhile(lambda x: not x.startswith("#"), f))
        for k, v in grps:
            if k:
                yield chain([next(v)], (next(grps)[1]))  # all lines up to next #TYPE

Demo:

In [16]: cat in.txt
meta
more meta
#TYPE Lorem.Text.A
first
#TYPE Lorem.Text.B
second
second
#TYPE Lorem.Text.C
third

In [17]: for sec in get_sections("in.txt"):
            print(list(sec))
   ....:     
['#TYPE Lorem.Text.A\n', 'first\n']
['#TYPE Lorem.Text.B\n', 'second\n', 'second\n']
['#TYPE Lorem.Text.C\n', 'third\n']

Upvotes: 4

Avinash Raj
Avinash Raj

Reputation: 174696

Do splitting according to the new line char exists before #TYPE

chunks = re.split(r'\n(?=#TYPE\b *)', f.read())

Example:

>>> import re
>>> s = '''#TYPE Lorem.Text.A
...
#TYPE Lorem.Text.B
...
#TYPE Lorem.Text.C
...'''
>>> re.split(r'\n(?=#TYPE *)', s)
['#TYPE Lorem.Text.A\n...', '#TYPE Lorem.Text.B\n...', '#TYPE Lorem.Text.C\n...']
>>> 

Upvotes: -1

Related Questions