zbinsd
zbinsd

Reputation: 4214

How do I open a gzip file in Google Datalab?

I have a bucket that contains a file.csv.gz. It's around 210MB and I'd like to read it into pandas. Anyone know how to do that?

For a non-gz, this works:

%gcs read --object gs://[bucket-name]/[path/to/file.csv] --variable csv

# Store in a pandas dataframe
df = pd.read_csv(StringIO(csv))

Upvotes: 0

Views: 379

Answers (2)

Xiaoxia Lin
Xiaoxia Lin

Reputation: 746

You can still use pandas.read_csv, but you have to specify compression=’gzip’, and import StringIO from pandas.compat.

I tried the code below with a small file in my Datalab, and it worked for me.

%gcs read --object gs://[bucket-name]/[path/to/file.csv] --variable my_file 

import pandas as pd
from pandas.compat import StringIO

df = pd.read_csv(StringIO(my_file), compression='gzip')
df

Upvotes: 1

Bradley Jiang
Bradley Jiang

Reputation: 424

"%%gcs read" command does not work with compressed data.

"%%gcs read" load all the content as a string. Since your compressed size is already 210MB, it might not be a good idea to read it all as a string anyway.

In your case, maybe you can consider BigQuery commands. "%%bq" supports compressed csv (only .gz format) as data source. The following shows how to do it:

Cell 1 -- Define the data source:

%%bq datasource --name mycsv --path gs://b/o.csv.gz --compressed --format csv schema: - name: url type: STRING - name: label type: STRING

Cell 2 -- Define the query:

%%bq query --datasources mycsv --name myquery SELECT * FROM mycsv

Cell 3: -- run the query and save it to a DataFrame:

df = %%bq execute --query myquery --to-dataframe

In cell 2, you probably want to add some filters and select only the columns you want. Otherwise you are loading the whole file into memory, which might be too large.

Note that the commands above invokes BigQuery operations, so it requires enabling of BigQuery API in your project, and may also incur some costs.

Upvotes: 0

Related Questions