How can i open and process super heavy 800PB csv file?

Question

How can i open a file that is 800 petabytes?

It's a file for some data science competition- 807167556410028 kb = 800000,556410028 TB = ~800PB

It's archived into 600 mb but i can't unzip it due to big size. Is it possible to read the first 1000 rows from the zipped archive with pandas?

ERJAN · Accepted Answer

import zipfile
archive = zipfile.ZipFile('bigfile.zip')
file = archive.open('big.csv')
textfilereader = pd.read_csv(file, chunksize=1000000)
df = textfilereader.get_chunk()

#df now is the dataframe.

This is somewhat partial answer as it just reads chunksize number of rows.

p.s. i tested it with 3mln rows, it fails with memory error.

p.p.s. Its the bug of my winrar archive program! I installed 7zip and it shows it's only 5GB! Lol, good lesson to learn, sometime it's the program, not the dataset!

How can i open and process super heavy 800PB csv file?

Answers (1)

Related Questions