Reputation: 13
I am able to load a csv into pandas dataframe, but it is stuck in a list. How can I load directly into a pandas dataframe from Pydrill or unlist the pandas dataframe columns and data? I've tried unlisting and it puts everything into a list of a list.
I've used the to_dataframe(), but can't seem to find documentation on if I can use a delimeter. pd.dataframe doesn't work because of the Pydrill query.
reviews = drill.query("SELECT * FROM hdfs.datasets.`titanic_ML/titanic.csv` LIMIT 1000", timeout=30)
print(reviews)
import pandas as pd
df2 = reviews.to_dataframe()
df2.rename(columns=df2.iloc[0])
headers = df2.iloc[0]
print(headers)
new_df = pd.DataFrame(df2.values[1:], columns=headers)
new_df.head()
The results cast everything into a list.
["pclass","sex","age","sibsp","parch","fare","embarked","survived"]
0 ["3","1","38.0","0","0","7.8958","1","0"]
1 ["1","1","42.0","0","0","26.55","1","0"]
2 ["3","0","9.0","4","2","31.275","1","0"]
3 ["3","1","27.0","0","0","7.25","1","0"]
4 ["1","1","41.0","0","0","26.55","1","0"]
I'd like to get everything into a normal pandas dataframe.
Upvotes: 0
Views: 353
Reputation: 95
Try using Table Functions as described in O’Reily Text: Chapter 4. Querying Delimited Data. This will delimit the file and apply the first row to your columns. Note: because everything is being read as text, you may need to cast
your values as floats if you want to do arithmetic in your select
or where
.
This should get you what you want:
sql="""
SELECT *
FROM table(hdfs.datasets.`/titanic_ML/titanic.csv`(
type => 'text',
extractHeader => true,
fieldDelimiter => ',')
) LIMIT 1000
"""
rows = drill.query(sql, timeout=30)
df = rows.to_dataframe()
df.head()
Upvotes: 0
Reputation: 13
The solution I found was this:
it doesn't unlist the dataframe, but it's an alternate solution to the problem.
connect_str = "dbname='dbname' user='dsa_ro_user' conn = psycopg2.connect(connect_str) host='host database'
SQL = "SELECT * " SQL += " FROM train"
df = pd.read_sql(SQL,conn) df.head()
Upvotes: 0