Jaya
Jaya

Reputation: 1

sequence files from sqoop import

I have imported a table using sqoop and saved it as a sequence file.

How do I read this file into an RDD or Dataframe?

I have tried sc.sequenceFile() but I'm not sure what to pass as keyClass and value Class. I tried tried using org.apache.hadoop.io.Text, org.apache.hadoop.io.LongWritable for keyClass and valueClass but it did not work. I am using pyspark for reading the files.

Upvotes: 0

Views: 413

Answers (2)

Gara Walid
Gara Walid

Reputation: 455

You can use SeqDataSourceV2 package to read the sequence file with the DataFrame API without any prior knowledge of the schema (aka keyClass and valueClass).
Please note that the current version is only compatible with Spark 2.4

$ pyspark --packages seq-datasource-v2-0.2.0.jar
df = spark.read.format("seq").load("data.seq")
df.show()

Upvotes: 0

Karthik
Karthik

Reputation: 1171

in python its not working however in SCALA it works:

You need to do following steps:

step1: If you are importing as sequence file from sqoop, there is a jar file generated, you need to use that as ValueClass while reading sequencefile. This jar file is generally placed in /tmp folder, but you can redirect it to a specific folder (i.e. to local folder not hdfs) using --bindir option.

example: sqoop import --connect jdbc:mysql://ms.itversity.com/retail_export -- username retail_user --password itversity --table customers -m 1 --target-dir '/user/srikarthik/udemy/practice4/problem2/outputseq' --as-sequencefile --delete-target-dir --bindir /home/srikarthik/sqoopjars/

step2: Also, you need to download the jar file from below link: http://www.java2s.com/Code/Jar/s/Downloadsqoop144hadoop200jar.htm

step3: Suppose, customers table is imported using sqoop as sequence file. Run spark-shell --jars path-to-customers.jar,sqoop-1.4.4-hadoop200.jar

example:

spark-shell --master yarn --jars /home/srikarthik/sqoopjars/customers.jar,/home/srikarthik/tejdata/kjar/sqoop-1.4.4-hadoop200.jar

step4: Now run below commands inside the spark-shell

scala> import org.apache.hadoop.io.LongWritable

scala> val data = sc.sequenceFile[LongWritable,customers]("/user/srikarthik/udemy/practice4/problem2/outputseq")

scala> data.map(tup => (tup._1.get(), tup._2.toString())).collect.foreach(println)

Upvotes: 0

Related Questions