Reputation: 4575
I'm new to Python and Hive.
I was hoping I might get some advice.
Does anyone have any tips on how to turn a python pandas dataframe into a hive table?
Upvotes: 3
Views: 7635
Reputation: 1
Based on Jose Antonio Martin H's answer... I could not find an easy way of doing this. I've been unable to get Pandas Dataframe.to_sql() working with the Cloudera ODBC driver So, as mine is a one-off case, I've manually exported Dataframe.to_csv() and used the HUE/Hive Importer tool on it once it's on HDFS Where Jose's answer helped me is in using a non-comma delimiter ("|" actually, rather than "," or "\t") and also, turning index off. These seemed to help the process. I could not get parquet format to work, with or without compression - which I had thought to be the problem. And neither could "load data local inpath"
Just my experience, if it helps. If I get any of it working programmatically I'll try to let you here know.
(BTW I can't comment yet, but hopefully sharing my own experience here helps others in a predicament.)
Upvotes: 0
Reputation: 1511
Your script should run inside a machine where hive can load data using the "load local data in path" method.
Query pandas data frame to create a list of column name datatype
Compose a valid HQL (DDL) create table statement using python string operations (basically concatenations)
Issue a create table statement in Hive.
Write the pandas dataframe as cvs separated by "\t" turning headers off and index off (check paramerets of to_csv() )
5.- From your python script call a system console running hive -e:
Use: for instance:
p = subprocess.Popen( ['hive', '-e', str_command_list], stdout = subprocess.PIPE,
stderr = subprocess.PIPE )
out, err = p.communicate()
This will call hive console and execute for instance, load data local inpath, inserting your csv data into the created table.
Then you are happy.
Upvotes: 1