Reputation: 81
I need to create a Pandas DataFrame from a large file with space delimited values and row structure that is depended on the number of columns.
Raw data looks like this:
2008231.0 4891866.0 383842.0 2036693.0 4924388.0 375170.0
On one line or several, line breaks are ignored.
End result looks like this, if number of columns is three:
[(u'2008231.0', u'4891866.0', u'383842.0'),
(u'2036693.0', u'4924388.0', u'375170.0')]
Splitting the file into rows is depended on the number of columns which is stated in the meta part of the file.
Currently I split the file into one big list and split it into rows:
def grouper(n, iterable, fillvalue=None):
"Collect data into fixed-length chunks or blocks"
# grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx
args = [iter(iterable)] * n
return izip_longest(fillvalue=fillvalue, *args)
(code is from itertools examples)
Problem is, I end up with multiple copies of the data in memory. With 500MB+ files this eats up the memory fast and Pandas has some trouble reading lists this big with large MultiIndexes.
How can I use Pandas file reading functionality (read_csv, read_table, read_fwf) with this kind of data?
Or is there an other way of reading data into Pandas without auxiliary data structures?
Upvotes: 1
Views: 1777
Reputation: 880369
Although it is possible to create a custom file-like object, this will be very slow compared to the normal usage of pd.read_table
:
import pandas as pd
import re
filename = 'raw_data.csv'
class FileLike(file):
""" Modeled after FileWrapper
http://stackoverflow.com/a/14279543/190597 (Thorsten Kranz)
"""
def __init__(self, *args):
super(FileLike, self).__init__(*args)
self.buffer = []
def next(self):
if not self.buffer:
line = super(FileLike, self).next()
self.buffer = re.findall(r'(\S+\s+\S+\s+\S+)', line)
if self.buffer:
line = self.buffer.pop()
return line
with FileLike(filename, 'r') as f:
df = pd.read_table(f, header=None, delimiter='\s+')
print(len(df))
When I try using FileLike
on a 5.8M file (consisting of 200000 lines), the above code takes 3.9 seconds to run.
If I instead preprocess the data (splitting each line into 2 lines and writing the result to disk):
import fileinput
import sys
import re
filename = 'raw_data.csv'
for line in fileinput.input([filename], inplace = True, backup='.bak'):
for part in re.findall(r'(\S+\s+\S+\s+\S+)', line):
print(part)
then you can of course load the data normally into Pandas using pd.read_table
:
with open(filename, 'r') as f:
df = pd.read_table(f, header=None, delimiter='\s+')
print(len(df))
The time required to rewrite the file was ~0.6 seconds, and now loading the DataFrame took ~0.7 seconds.
So, it appears you will be better off rewriting your data to disk first.
Upvotes: 1
Reputation: 375695
I don't think there is a way to seperate rows with the same delimiter as columns.
One way around this is to reshape
(this will most likely be a copy rather than a view, to keep the data contiguous) after creating a Series using read_csv
:
s = pd.read_csv(file_name, lineterminator=' ', header=None)
df = pd.DataFrame(s.values.reshape(len(s)/n, n))
In your example:
In [1]: s = pd.read_csv('raw_data.csv', lineterminator=' ', header=None, squeeze=True)
In [2]: s
Out[2]:
0 2008231
1 4891866
2 383842
3 2036693
4 4924388
5 375170
Name: 0, dtype: float64
In [3]: pd.DataFrame(s.values.reshape(len(s)/3, 3))
Out[3]:
0 1 2
0 2008231 4891866 383842
1 2036693 4924388 375170
Upvotes: 0