Reputation: 2742
Given a tab delimited file in the following format, I need to perform operations on column-wise manner. The easiest solution that I know is to use pandas package, however my file being couple of gigs, I am having difficulty reading it. So I decided to use normal streaming method.
Here, I want to generate column variables dynamically (the number of columns are ~ 500), each of which will store all the values in that particular column including blanks. For ex:
variable a
will be a list [11,22,31,,42,555]
.
a b c d e f
11 9 9 1 6
22 8 0 8 2
31 7 7 3 5
6 1 6 4
42 6 2 5 2 3
555 3 4 1 2
How can I generate these variables by reading the entire file just once?
Upvotes: 1
Views: 4304
Reputation: 365747
From a comment:
It can't load the whole file into memory and seems like loading for ever.
If this is your problem, you're attacking it wrong. If you don't have enough memory to hold the whole file at once, you probably don't have enough to hold the whole array at once either. You definitely won't if you're using pure-Python objects, but probably not even with pandas
or numpy
either.
So, what you need to do is convert it on the fly from one disk-based format to another, and then rewrite your code to process the data off the disk much more quickly than the original file.
One obvious possibility is a file that you can random-access seek or mmap through. For example, let's say you've got 500 columns full of values that all fit within range(0, 1<<32)
, and you don't know how many rows you have.
Iterate the csv once just to count the rows, then create a file of 4 * 500 * rows
bytes, then iterate the csv again to fill it out using the struct
module. Then you can access any column by just creating an mmap
view over 4 * rows * col, 4 * rows * (col+1)
.
Or, alternatively, transpose on the fly. This will be simpler, and have half the I/O, but may be slower (because of both memory and disk cache effects).
If you only need by-column random access, not by-row, just create a file for each column, and to handle any column, just mmap
the appropriate file, or read it via struct
or array
, or just use pandas
or numpy
to read/write each one (or possibly just pickle
a pandas
table/numpy
array directly).
Another possibility is to use a key-value database and store big blobs in it, equivalent to the files. (In fact, if you're planning on pickled tables/arrays, a shelve.shelf
is a key-value database where the values are persisted by pickling.)
To pick from those ideas arbitrarily, here's how you'd create separate files, transpose on the fly, and store the data using struct
:
Let me give you an example with separate files and struct
:
import csv
import struct
# You may want to check whether the column_foo.dat files already exist and are
# at least as new as input.csv, so you don't re-generate them unless the input
# data changes. That's obviously only a worthwhile optimization if you run the
# script multiple times on the exact same input.
with open('input.csv', 'rb') as f:
reader = csv.reader(f, delimiter='\t'):
column_names = next(reader)
column_files = [open('column_{}.dat'.format(column_name), 'wb')
for column_name in column_names]
for row in reader:
for i, col in enumerate(row):
value = int(col)
column_files[i].write(struct.pack('!I', value))
for f in column_files:
f.close()
Now, to access a column later:
def get_column(name):
with open('column_{}.dat'.format(name), 'rb') as f:
fsize = os.path.getsize(f.fileno())
length = fsize / 4
fmt = '!{}I'.format(length)
return struct.unpack(fmt, f.read())
Upvotes: 1
Reputation: 3
def readAsCloumVec(fileName):
reader = open(fileName,'r')
text = reader.read()
lines = text.split('\n')#Split the text into lines
matrix = []
for i in range(1,len(lines)):
if(len(lines[i])>0):#Ignore the last useless line
line = lines[i].split('\t')#The tag is depend on your file format
matrix.append(line)
cloumMatrix = []
cloums = len(matrix[0])
row = len(matrix)
print matrix
for j in range(cloums):#transposition the matrix
cloum = []
for i in range(row):
cloum.append(matrix[i][j])
cloumMatrix.append(cloum)
return cloumMatrix
if __name__ == "__main__":
print readAsCloumVec('test.txt')
Upvotes: -1
Reputation: 17849
You're best bet if you really don't want to use pandas, use numpy because numpy does have a transpose effect so you can change it so the columns are the index!
import csv
import numpy as np
with open('file_name.csv', 'rb') as f:
reading = csv.reader(f,delimiter='\t')
columns = reading.next()
hold_files = []
for row in reading:
hold_files.append(row)
data = np.array(hold_files)
data = data.T
now data
is an array where each item is an array containing each item that was in each column! Still costly and pandas is still better for all of this, but this does answer your question!
Also, I suggest using pickle
to save it to disk! It will take more time, but you wont have to iterate through it again! Here is how you would do that:
import pickle
def pickleSave(data, pickleFile):
output = open(pickleFile, 'wb')
pickle.dump(data, output)
output.close()
print "file has been saved to %s" % (pickleFile)
def pickleLoad(pickleFile):
pkl_file = open(pickleFile, 'rb')
data = pickle.load(pkl_file)
pkl_file.close()
return data
this will save whatever is in data
, an array, dataframe, etc. to whatever filename you want:pickleFile
the following will save the array I just made!
file_name = "my_array.txt"
pickleSave(data,file_name)
now whenever you want to use that array, you have it saved and can load it with the following:
loaded_data = pickleLoad("my_array.txt")
now the variable loaded_data
contains the array I previously created! You can use this on any type of file!
Upvotes: 1
Reputation: 365747
You've really got two questions in one here.
The easiest solution that I know is to use pandas package, however my file being couple of gigs, I am having difficulty reading it. So I decided to use normal streaming method.
Don't do that. Pandas can handle huge amounts of data with a lot less memory and time than anything you write in pure Python. And less verbosely, as well. It's conceivable that dropping down to numpy
and then pandasizing the results after reading will help, but dropping down to pure Python is guaranteed to make things worse.
If you have a problem with your code, show us your code (and enough information to generate equivalent sample data).
Here, I want to generate column variables dynamically (the number of columns are ~ 500), each of which will store all the values in that particular column including banks. For ex: variable a will be a list [11,22,31,,42,555].
Definitely don't do that. You almost never want to generate variables dynamically like this. There is no benefit to it, and there are plenty of downsides.
This is exactly what dictionaries are for. Instead of this:
for column in columns:
create_variable_named(column[0]).set_to(column[1:])
Just do this:
column_dict = {column[0]: column[1:] for column in columns}
Then you can access it like this:
column_dict['a']
And, if you want to iterate over all your columns, instead of this:
for col in a, b, c, d, …, zz, aaa, …:
You can just do this:
for col in column_dict.values():
If you really, really want to create dynamic variables, you can. Depending on what scope or namespace you want to add them to, it's one of the following:
globals()[name] = value
locals()[name] = value
setattr(module, name, value)
setattr(obj, name, value)
setattr(cls, name, value)
… and so on.
Upvotes: 2
Reputation: 298196
You can read it with the csv
module and a generator:
import csv
def get_column(name):
with open('filename', 'r') as handle:
reader = csv.DictReader(handle, delimiter='\t')
for row in reader:
yield row[name]
But Pandas is probably going to be faster once you start doing something with this data.
Upvotes: 0
Reputation: 1212
Try a dictionary of lists:
with open('myfile.txt', 'r') as infile:
header = infile.readline()
col_values = {var:[] for var in header.split()}
var_idx_dict = {idx:var for idx,var in enumerate(header.split())}
for line in infile:
values = line.split('\t')
for idx,value in enumerate(values):
var = var_idx_dict[idx]
col_values [var].append(value)
col_values
should contain string keys of your column headers, with each key pointing to a list of all the values in that column.
EDIT: given that your file is very large and I don't know your environment, and you're having trouble with pandas, consider giving PyTables a try. I can provide some assistance with that...it depends on your end goal (basic math? statistics? data formatting?).
Upvotes: 0