jarpineh
jarpineh

Reputation: 81

Reading variable column and row structure to Pandas by column amount

I need to create a Pandas DataFrame from a large file with space delimited values and row structure that is depended on the number of columns.

Raw data looks like this:

2008231.0 4891866.0 383842.0 2036693.0 4924388.0 375170.0

On one line or several, line breaks are ignored.

End result looks like this, if number of columns is three:

[(u'2008231.0', u'4891866.0', u'383842.0'),
(u'2036693.0', u'4924388.0', u'375170.0')]

Splitting the file into rows is depended on the number of columns which is stated in the meta part of the file.

Currently I split the file into one big list and split it into rows:

def grouper(n, iterable, fillvalue=None):
    "Collect data into fixed-length chunks or blocks"
    # grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx
    args = [iter(iterable)] * n
    return izip_longest(fillvalue=fillvalue, *args)

(code is from itertools examples)

Problem is, I end up with multiple copies of the data in memory. With 500MB+ files this eats up the memory fast and Pandas has some trouble reading lists this big with large MultiIndexes.

How can I use Pandas file reading functionality (read_csv, read_table, read_fwf) with this kind of data?

Or is there an other way of reading data into Pandas without auxiliary data structures?

Upvotes: 1

Views: 1777

Answers (2)

unutbu
unutbu

Reputation: 880369

Although it is possible to create a custom file-like object, this will be very slow compared to the normal usage of pd.read_table:

import pandas as pd
import re

filename = 'raw_data.csv'
class FileLike(file):
    """ Modeled after FileWrapper
    http://stackoverflow.com/a/14279543/190597 (Thorsten Kranz)
    """
    def __init__(self, *args):
        super(FileLike, self).__init__(*args)
        self.buffer = []
    def next(self):
        if not self.buffer:
            line = super(FileLike, self).next()
            self.buffer = re.findall(r'(\S+\s+\S+\s+\S+)', line)
        if self.buffer:
            line = self.buffer.pop()
            return line

with FileLike(filename, 'r') as f:
    df = pd.read_table(f, header=None, delimiter='\s+')
    print(len(df))

When I try using FileLike on a 5.8M file (consisting of 200000 lines), the above code takes 3.9 seconds to run.

If I instead preprocess the data (splitting each line into 2 lines and writing the result to disk):

import fileinput
import sys
import re

filename = 'raw_data.csv'
for line in fileinput.input([filename], inplace = True, backup='.bak'):
    for part in re.findall(r'(\S+\s+\S+\s+\S+)', line):
        print(part)

then you can of course load the data normally into Pandas using pd.read_table:

with open(filename, 'r') as f:
    df = pd.read_table(f, header=None, delimiter='\s+')
    print(len(df))

The time required to rewrite the file was ~0.6 seconds, and now loading the DataFrame took ~0.7 seconds.

So, it appears you will be better off rewriting your data to disk first.

Upvotes: 1

Andy Hayden
Andy Hayden

Reputation: 375695

I don't think there is a way to seperate rows with the same delimiter as columns.

One way around this is to reshape (this will most likely be a copy rather than a view, to keep the data contiguous) after creating a Series using read_csv:

s = pd.read_csv(file_name, lineterminator=' ', header=None)
df = pd.DataFrame(s.values.reshape(len(s)/n, n))

In your example:

In [1]: s = pd.read_csv('raw_data.csv', lineterminator=' ', header=None, squeeze=True)

In [2]: s
Out[2]: 
0    2008231
1    4891866
2     383842
3    2036693
4    4924388
5     375170
Name: 0, dtype: float64

In [3]: pd.DataFrame(s.values.reshape(len(s)/3, 3))
Out[3]: 
         0        1       2
0  2008231  4891866  383842
1  2036693  4924388  375170

Upvotes: 0

Related Questions