Reputation: 912
I usually find question and discussion about loading dataset with several million rows to python, by using Dask or Pandas chunk-size, but my problem is a bit different. I got millions of columns/features, and only a few thousand records. I found that the data loading time(from csv) with such dataset is absurdly slow, and consume large memory, I have done some benchkmark and sometimes, pandas is even faster than dask!
I tested the case in which I have 1 million rows and 300 columns, I can load it in memory easily, but if I have 300 rows and 1 million columns, then pandas consumes all 64GB RAM and dies.
How can I handle such dataset?
Thank you very much.
Upvotes: 0
Views: 3152
Reputation: 16673
This was my idea in the comments if you were to use pandas. This is untested, but you could do columns in chunks using usecols
dynamically. I said iloc
in comments, but that would still require reading the entire file first, so what I meant was usecols
. You can just adjust i
and the number in the range.
i = 10
for _ in range(1,4):
cols = list(range(i-10,i))
print(cols)
df = pd.read_csv(f, usecols=cols).T
df.append(df)
i += 10
df = df.T
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
[20, 21, 22, 23, 24, 25, 26, 27, 28, 29]
As you can see you can essentially "chunk" by columns using this technique.
Upvotes: 2
Reputation: 25190
I tested the case in which I have 1 million rows and 300 columns, I can load it in memory easily, but if I have 300 rows and 1 million columns, then pandas consumes all 64GB RAM and dies.
How can I handle such dataset?
Pandas is built on the assumptions that:
If those assumptions don't hold, then Pandas will be very inefficient.
You can read these slides to understand how Pandas works internally. Essentially, Pandas specializes each column by type, picking the most restrictive type for that column, which allows Pandas to use less memory for each cell. However, if you have few rows and many columns, then the per-column overhead will be very large.
I would suggest that you avoid formatting your data like this. Here is some Python code which can transpose a CSV file without using Pandas:
import csv
def transpose():
a = zip(*csv.reader(open("test.csv", "rt")))
csv.writer(open("transposed.csv", "wt")).writerows(a)
(On my computer, this takes 300MB for a 100K by 300 CSV file. Compare that to 900MB for opening it in read_csv normally.)
Upvotes: 3