Reputation: 65
I'm trying to read a compressed csv file(.bz2) as a DataFrame. My code is as follows
// read the data
Dataset<Row> rData = spark.read().option("header", true).csv(input);
This works when I try in the IDE. I can read the data and process it, but when I try to build it using maven and run it on command line I get the following error
Exception in thread "main" java.lang.ClassNotFoundException: Failed to find data source: csv. Please find packages at http://spark.apache.org/third-party-projects.html
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:569)
at org.apache.spark.sql.execution.datasources.DataSource.providingClass$lzycompute(DataSource.scala:86)
at org.apache.spark.sql.execution.datasources.DataSource.providingClass(DataSource.scala:86)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:325)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:415)
at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:352)
at com.cs6240.Driver.main(Driver.java:28)
Caused by: java.lang.ClassNotFoundException: csv.DefaultSource
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$25$$anonfun$apply$13.apply(DataSource.scala:554)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$25$$anonfun$apply$13.apply(DataSource.scala:554)
at scala.util.Try$.apply(Try.scala:192)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$25.apply(DataSource.scala:554)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$25.apply(DataSource.scala:554)
at scala.util.Try.orElse(Try.scala:84)
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:554)
... 7 more
I'm not sure if I'm missing something here. Is there some dependency for reading csv files? According to the documentation, there is built in support for this from Spark 2.x.x.
Upvotes: 1
Views: 734
Reputation: 65
I fixed the issue by following the steps in this answer. https://stackoverflow.com/a/39465892/2705924
Basically there was some issue with the assembly plugin and when I used the shade plugin and used this
<transformer implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer"/>
Upvotes: 1