Geospatial Enterprise
Geospatial Enterprise

Reputation: 13

Load csv into pandas dataframe from Pydrill Query

I am able to load a csv into pandas dataframe, but it is stuck in a list. How can I load directly into a pandas dataframe from Pydrill or unlist the pandas dataframe columns and data? I've tried unlisting and it puts everything into a list of a list.

I've used the to_dataframe(), but can't seem to find documentation on if I can use a delimeter. pd.dataframe doesn't work because of the Pydrill query.

reviews = drill.query("SELECT * FROM  hdfs.datasets.`titanic_ML/titanic.csv` LIMIT 1000", timeout=30)
print(reviews)
import pandas as pd
df2 = reviews.to_dataframe()
df2.rename(columns=df2.iloc[0])
headers = df2.iloc[0]
print(headers)

new_df = pd.DataFrame(df2.values[1:], columns=headers)
new_df.head()

The results cast everything into a list.

["pclass","sex","age","sibsp","parch","fare","embarked","survived"]

0   ["3","1","38.0","0","0","7.8958","1","0"]
1   ["1","1","42.0","0","0","26.55","1","0"]
2   ["3","0","9.0","4","2","31.275","1","0"]
3   ["3","1","27.0","0","0","7.25","1","0"]
4   ["1","1","41.0","0","0","26.55","1","0"]

I'd like to get everything into a normal pandas dataframe.

Upvotes: 0

Views: 353

Answers (2)

lochnessbobster
lochnessbobster

Reputation: 95

Try using Table Functions as described in O’Reily Text: Chapter 4. Querying Delimited Data. This will delimit the file and apply the first row to your columns. Note: because everything is being read as text, you may need to cast your values as floats if you want to do arithmetic in your select or where.

This should get you what you want:

sql="""
SELECT * 
FROM table(hdfs.datasets.`/titanic_ML/titanic.csv`(
    type => 'text', 
    extractHeader => true, 
    fieldDelimiter => ',')
) LIMIT 1000
"""

rows = drill.query(sql, timeout=30)

df = rows.to_dataframe()
df.head()

Upvotes: 0

Geospatial Enterprise
Geospatial Enterprise

Reputation: 13

The solution I found was this:

it doesn't unlist the dataframe, but it's an alternate solution to the problem.

connect_str = "dbname='dbname' user='dsa_ro_user' conn = psycopg2.connect(connect_str) host='host database'

SQL = "SELECT * " SQL += " FROM train"

df = pd.read_sql(SQL,conn) df.head()

Upvotes: 0

Related Questions