How to run LOAD DATA INPATH hive command with wildcard from spark?

Question

I am creating a dataframe as below:

from pyspark.sql import SparkSession, functions as f
from pyspark.sql.types import StructType, StructField, DataType, IntegerType, StringType
schma = StructType([
    StructField("id", IntegerType(), True),
    StructField("name",StringType(), True),
]
)
empdf=spark.read.format("csv").csv("/home/hdfs/sparkwork/hiveproj/Datasets/empinfo/emp.csv",schema=schma);
empdf.show();

I am saving the dataframe as a parquet file.

empdf.write.parquet(path="/home/hdfs/sparkwork/hiveproj/Data/empinfo/empl_par/")

If I am using the specific file name in LOAD DATA INPATH command then it is working fine.

spark.sql("LOAD DATA INPATH '/home/hdfs/sparkwork/hiveproj/Data/empinfo/empl_par/part-00000-6cdfcba5-49ab-499c-8d7f-831c9ec314de-c000.snappy.parquet' INTO TABLE EMPINFO.EMPLOYEE")

But If i am using wildcard instead of file name(* or *.parquet) it is giving me error.

spark.sql("LOAD DATA INPATH '/home/hdfs/sparkwork/hiveproj/Data/empinfo/empl_par/*.parquet' INTO TABLE EMPINFO.EMPLOYEE")

Is there a way to push all the contents of a folder using wildcard in hive command from spark? please help with the same.

s.polam · Accepted Answer

Instead of spark.sql("LOAD DATA INPATH '/home/hdfs/sparkwork/hiveproj/Data/empinfo/empl_par/*.parquet' INTO TABLE EMPINFO.EMPLOYEE")

try using this empdf.write.partitionBy("year","month","day").insertInto("EMPINFO.EMPLOYEE")

Note I have used partition columns as year,month & day. You may need to change as per your requirement.

How to run LOAD DATA INPATH hive command with wildcard from spark?

Answers (1)

Related Questions