Reputation: 21
Other people uses this code.
spark.read \
.format('com.databricks.spark.xml') \
.option('rootTag', 'tags') \
.option('rowTag', 'row') \
.load('example.xml')
I don't want to use databricks,
So, i tried like this.
df = spark.read.format('xml').options(rowTag='file').load('ted_en-20160408.xml')
but there are error.
Py4JJavaError: An error occurred while calling o222.load.
: java.lang.ClassNotFoundException: Failed to find data source: xml. Please find packages at http://spark.apache.org/third-party-projects.html
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:692)
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSourceV2(DataSource.scala:746)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:265)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:239)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: xml.DefaultSource
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$lookupDataSource$5(DataSource.scala:666)
at scala.util.Try$.apply(Try.scala:213)
at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$lookupDataSource$4(DataSource.scala:666)
at scala.util.Failure.orElse(Try.scala:224)
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:666)
... 14 more
I want to read xml data, and parse the data.
My final goal is tf-idf and SVD.
Java : 1.8.0
spark : 3.1.2
Scala : 2.12.10
Python : 3.8.5
Upvotes: 0
Views: 895
Reputation: 4189
You can try to read the content of the xml file as a string into the spark dataframe, and then use the Spark SQL xpath series of functions to process it.
Upvotes: 1
Reputation: 191681
But I don't want to use Databricks
Okay, then you need to implement your own Spark data format reader for XML since that's not a built-in option
Otherwise, write your parser elsewhere, then reformat your data to something Spark can work with out of the box. For example, read the complete file as a string, then use Python lxml
or etree
modules to build out a Dataframe with some schema
Upvotes: 1