Storm
Storm

Reputation: 25

Python (pandas): How to read data by chunks in a long file where chunks are separated by a header and are not equal in length?

I have the following data in a single txt file:

00001+++00001   000031         12.8600    -1    7     BEAR

     1990052418  276.00  18.80     0.00  12.86

     1990052500  276.00  19.70     0.00  12.86

00002+++00002   000047         30.8700   995   22     LION

     1990072206  318.10   8.80  1010.00  12.86

     1990072212  316.80   8.90  1010.00  12.86

     1990072218  315.40   9.30  1010.00  12.86

     1990072300  313.60   9.30  1010.00  12.86

00003+++00003   000050         36.0100   973   37      BIRD

     1990072412  285.00  34.00  1012.00  10.29

     1990072418  286.20  33.60  1013.00  10.29

     1990072500  287.30  33.00  1013.00  10.29

How can I read each (unequal) chunk under each header separately in Python (Pandas)?

Upvotes: 1

Views: 630

Answers (1)

Andy Hayden
Andy Hayden

Reputation: 375675

Here's a solution which parses each block (without reading in the entire file):

def read_fwfs(text_file):
    res = {}
    with open(text_file) as f:
        header = f.readline()
        block = []
        for line in f:
            if "+++" in line:
                #  End of block, so save it to dict as DataFrame.
                #  Note: this next line could be pulled out as a function.
                res[header.split("+++")[0]] = pd.read_fwf(StringIO("".join(block)), header=None)
                #  Reset variables.
                header = line
                block = []
            #  Ignore blank lines.
            elif line != "\n":
                block.append(line)
        #  Save the last block.
        #  Note: See what I mean about being a function? Here it is again:
        res[header.split("+++")[0]] = pd.read_fwf(StringIO("".join(block)), header=None)
        return res

In [11]: d = read_fwfs("my_text_file.txt")
         d
Out[11]:
{'00001':             0    1     2  3      4
          0  1990052418  276  18.8  0  12.86
          1  1990052500  276  19.7  0  12.86,
 '00002':             0      1    2     3      4
          0  1990072206  318.1  8.8  1010  12.86
          1  1990072212  316.8  8.9  1010  12.86
          2  1990072218  315.4  9.3  1010  12.86
          3  1990072300  313.6  9.3  1010  12.86,
 '00003':             0      1     2     3      4
          0  1990072412  285.0  34.0  1012  10.29
          1  1990072418  286.2  33.6  1013  10.29
          2  1990072500  287.3  33.0  1013  10.29}

In [12]: d["00003"]
Out[12]:
            0      1     2     3      4
0  1990072412  285.0  34.0  1012  10.29
1  1990072418  286.2  33.6  1013  10.29
2  1990072500  287.3  33.0  1013  10.29

If you want to do something more with the splitting line, e.g. use it as a header, you could add that into the read_fwf part e.g. by adding the second half of the split header to the joined block. As I mention inline, it's probably a good idea to pull that part out as a function anyway.

Upvotes: 2

Related Questions