lingamaneni
lingamaneni

Reputation: 65

Processing XML string inside Spark UDF and return Struct Field

I have a dataframe column named Body(String). The body column data looks like this

<p>I want to use a track-bar to change a form's opacity.</p>

<p>This is my code:</p>

 <pre><code>decimal trans = trackBar1.Value / 5000;
this.Opacity = trans;
</code></pre>

<p>When I build the application, it gives the following error:</p>

<blockquote>
  <p>Cannot implicitly convert type 'decimal' to 'double'.</p>
</blockquote>

<p>I tried using <code>trans</code> and <code>double</code> but then the 
control doesn't work. This code worked fine in a past VB.NET project. </p>
,While applying opacity to a form should we use a decimal or double value?

Using Body I want to prepare two separate columns code and text. Code is between elements named code and text is everything else.

I have created a UDF which looks like this

 case class bodyresults(text:String,code:String)
 val Body:String=>bodyresults=(body:String)=>{ val xmlbody=scala.xml.XML.loadString(body)
val code = (xmlbody \\ "code").toString;
val text = "I want every thing else as text. what should I do"
(text,code)
}
val bodyudf=udf(Body)
val posts5=posts4.withColumn("codetext",bodyudf(col("Body")))

This is not working. My questions are 1.As you can see there is no root node in the data. can I still use scala XML parsing? 2. how to parse everything else except code into text.

If there is something wrong in my code please let me know

Expected output:

 (code,text)
 code = decimal trans = trackBar1.Value / 5000;this.Opacity = trans;trans double  
 text = everything else  

Upvotes: 1

Views: 1605

Answers (1)

philantrovert
philantrovert

Reputation: 10092

Instead of doing a replace, you can also use RewriteRule and override transform method of XML class to empty to <pre> tag in your xml.

case class bodyresults(text:String,code:String)

val bodyudf = udf{ (body: String)  =>

    // Appending body tag explicitly to the xml before parsing  
    val xmlElems = XML.loadString(s""" <body> ${body} </body> """)
    // extract the code inside the req
    val code = (xmlElems \\ "body" \\ "pre" \\ "code").text

    val text = (xmlElems \\ "body").text.replaceAll(s"${code}" ,"" )

    bodyresults(text, code)
}

This UDF will return a StructType like :

org.apache.spark.sql.UserDefinedFunction = UserDefinedFunction(<function1>,StructType(StructField(text,StringType,true), StructField(code,StringType,true)),List(StringType))

You can call it on you posts5 dataframe now like :

val posts5 = df.withColumn("codetext", bodyudf($"xml") )
posts5: org.apache.spark.sql.DataFrame = [xml: string, codetext: struct<text:string,code:string>]

To extract a specific column :

posts5.select($"codetext.code" ).show
+--------------------+
|                code|
+--------------------+
|decimal trans = t...|
+--------------------+

Upvotes: 1

Related Questions