Spark 2.1.0 : Reading compressed csv file

Question

I'm trying to read a compressed csv file(.bz2) as a DataFrame. My code is as follows

// read the data
    Dataset rData = spark.read().option("header", true).csv(input);

This works when I try in the IDE. I can read the data and process it, but when I try to build it using maven and run it on command line I get the following error

    Exception in thread "main" java.lang.ClassNotFoundException: Failed to find data source: csv. Please find packages at http://spark.apache.org/third-party-projects.html
    at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:569)
    at org.apache.spark.sql.execution.datasources.DataSource.providingClass$lzycompute(DataSource.scala:86)
    at org.apache.spark.sql.execution.datasources.DataSource.providingClass(DataSource.scala:86)
    at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:325)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
    at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:415)
    at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:352)
    at com.cs6240.Driver.main(Driver.java:28)
Caused by: java.lang.ClassNotFoundException: csv.DefaultSource
    at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$25$$anonfun$apply$13.apply(DataSource.scala:554)
    at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$25$$anonfun$apply$13.apply(DataSource.scala:554)
    at scala.util.Try$.apply(Try.scala:192)
    at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$25.apply(DataSource.scala:554)
    at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$25.apply(DataSource.scala:554)
    at scala.util.Try.orElse(Try.scala:84)
    at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:554)
    ... 7 more

I'm not sure if I'm missing something here. Is there some dependency for reading csv files? According to the documentation, there is built in support for this from Spark 2.x.x.

Arun · Accepted Answer

I fixed the issue by following the steps in this answer. https://stackoverflow.com/a/39465892/2705924

Basically there was some issue with the assembly plugin and when I used the shade plugin and used this

Spark 2.1.0 : Reading compressed csv file

Answers (1)

Related Questions