Reputation: 1209
I'm reading data from Excel sheet using pandas. I need to insert data into HIVE using pyspark.
sparkConf = SparkConf().setAppName("App")
sc = SparkContext(conf = sparkConf)
sqlContext = HiveContext(sc)
excel_file = pd.ExcelFile("export_n_moreExportData10846.xls")
for sheet_name in excel_file.sheet_names:
try:
df = pd.read_excel(excel_file, header=None, squeeze=True, sheet_name=sheet_name)
for i, row in df.iterrows():
if row.notnull().all():
data = df.iloc[(i+1):].reset_index(drop=True)
data.columns = list(df.iloc[i])
break
for c in data.columns:
data[c] = pd.to_numeric(data[c], errors='ignore')
print data #I need to insert this data into HIVE
except:
continue
Upvotes: 1
Views: 2193
Reputation: 186
You can check out the HadoopOffice library, it offers Excel reading/writing with many features on major Big Data platforms (MR, Hive, Flink, Spark...): https://github.com/ZuInnoTe/hadoopoffice/wiki
Upvotes: 0
Reputation: 21766
You can save your Pandas dataframe using the following code, providing the column types are compatible with Spark:
tablename = 'your_table_name'
df_spark = sqlContext.createDataFrame(data)
#Remove spaces from your column names
columns_with_spaces = filter(lambda x:' ' in x,df.columns)
for column in columns_with_spaces:
old_column = column
new_column = column.replace(' ','_')
df_spark = df_spark.withColumnRenamed(old_column , new_column)
#Save to Hive
df_spark.write.mode('overwrite').saveAsTable(tableName)
Upvotes: 1