Reputation: 4214
I have a bucket that contains a file.csv.gz
. It's around 210MB and I'd like to read it into pandas.
Anyone know how to do that?
For a non-gz, this works:
%gcs read --object gs://[bucket-name]/[path/to/file.csv] --variable csv
# Store in a pandas dataframe
df = pd.read_csv(StringIO(csv))
Upvotes: 0
Views: 379
Reputation: 746
You can still use pandas.read_csv, but you have to specify compression=’gzip’, and import StringIO from pandas.compat.
I tried the code below with a small file in my Datalab, and it worked for me.
%gcs read --object gs://[bucket-name]/[path/to/file.csv] --variable my_file
import pandas as pd
from pandas.compat import StringIO
df = pd.read_csv(StringIO(my_file), compression='gzip')
df
Upvotes: 1
Reputation: 424
"%%gcs read" command does not work with compressed data.
"%%gcs read" load all the content as a string. Since your compressed size is already 210MB, it might not be a good idea to read it all as a string anyway.
In your case, maybe you can consider BigQuery commands. "%%bq" supports compressed csv (only .gz format) as data source. The following shows how to do it:
Cell 1 -- Define the data source:
%%bq datasource --name mycsv --path gs://b/o.csv.gz --compressed --format csv schema: - name: url type: STRING - name: label type: STRING
Cell 2 -- Define the query:
%%bq query --datasources mycsv --name myquery SELECT * FROM mycsv
Cell 3: -- run the query and save it to a DataFrame:
df = %%bq execute --query myquery --to-dataframe
In cell 2, you probably want to add some filters and select only the columns you want. Otherwise you are loading the whole file into memory, which might be too large.
Note that the commands above invokes BigQuery operations, so it requires enabling of BigQuery API in your project, and may also incur some costs.
Upvotes: 0