Reputation: 11
How do I read a large table from hdfs in jupyter-notebook as a pandas DataFrame? The script is launched through the docker image.
libraries:
from impala.dbapi import connect
from impala.util import as_pandas
impala_conn = connect(host='hostname', port=21050,
auth_mechanism='GSSAPI',
timeout=100000, use_ssl=True, ca_cert=None,
ldap_user=None, ldap_password=None,
kerberos_service_name='impala')
This works.
import pandas as pd
df = pd.read_sql("select id, crt_mnemo from demo_db.stg_deals_opn LIMIT 100", impala_conn)
print(df)
This does not work. The operation hangs, does not give errors.
import pandas as pd
df = pd.read_sql("select id, crt_mnemo from demo_db.stg_deals_opn LIMIT 1000", impala_conn)
print(df)
Upvotes: 1
Views: 686
Reputation: 2239
This seems to be a problem with the number of rows you can move from impala using pandas read_sql function. I have the same problem but with lower limits than yours. You may want to contact the database admin to check the size. Here are other options: https://docs.cloudera.com/machine-learning/cloud/import-data/topics/ml-running-queries-on-impala-tables.html
Upvotes: 0