Arun
Arun

Reputation: 65

Spark 2.1.0 : Reading compressed csv file

I'm trying to read a compressed csv file(.bz2) as a DataFrame. My code is as follows

// read the data
    Dataset<Row> rData = spark.read().option("header", true).csv(input);

This works when I try in the IDE. I can read the data and process it, but when I try to build it using maven and run it on command line I get the following error

    Exception in thread "main" java.lang.ClassNotFoundException: Failed to find data source: csv. Please find packages at http://spark.apache.org/third-party-projects.html
    at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:569)
    at org.apache.spark.sql.execution.datasources.DataSource.providingClass$lzycompute(DataSource.scala:86)
    at org.apache.spark.sql.execution.datasources.DataSource.providingClass(DataSource.scala:86)
    at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:325)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
    at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:415)
    at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:352)
    at com.cs6240.Driver.main(Driver.java:28)
Caused by: java.lang.ClassNotFoundException: csv.DefaultSource
    at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$25$$anonfun$apply$13.apply(DataSource.scala:554)
    at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$25$$anonfun$apply$13.apply(DataSource.scala:554)
    at scala.util.Try$.apply(Try.scala:192)
    at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$25.apply(DataSource.scala:554)
    at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$25.apply(DataSource.scala:554)
    at scala.util.Try.orElse(Try.scala:84)
    at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:554)
    ... 7 more

I'm not sure if I'm missing something here. Is there some dependency for reading csv files? According to the documentation, there is built in support for this from Spark 2.x.x.

Upvotes: 1

Views: 734

Answers (1)

Arun
Arun

Reputation: 65

I fixed the issue by following the steps in this answer. https://stackoverflow.com/a/39465892/2705924

Basically there was some issue with the assembly plugin and when I used the shade plugin and used this

<transformer implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer"/>

Upvotes: 1

Related Questions