Reputation: 156
I have an rdd with the edges list which is comma separated like (source_URL, destination_URL). I have to extract the source host from source_URL. I tried the following code:
val edges = links.flatMap{case (src, dst) =>
if (!src.startsWith("http://") || !src.startsWith("https://"))
{ val src_url = "http://" + src
val url = new java.net.URL(src_url)
val uri = url.getHost
scala.util.Try {
Some(uri,dst)}
.getOrElse(None)}
else
{ val src_url = src
val url = new java.net.URL(src_url)
val uri = url.getHost
scala.util.Try {
Some(uri,dst)}
.getOrElse(None)}
}
Input sample:
http://www.belvini.de/weingut/mID/2530/max-markert.html,http://www.belvini.de/content.php/coID/299/kundenmeinungen.html
http://www.belvini.de/weingut/mID/2530/max-markert.html,http://www.belvini.de/weingueter
http://www.belvini.de/weingut/mID/2530/max-markert.html,http://www.belvini.de/filter/cID/10/country/suedafrika.137.html
Required output:
www.belvini.de,http://www.belvini.de/content.php/coID/299/kundenmeinungen.html
www.belvini.de,http://www.belvini.de/weingueter
www.belvini.de,http://www.belvini.de/filter/cID/10/country/suedafrika.137.html
While running the code, I am getting an exception:
Job aborted due to stage failure: Task 935 in stage 3.0 failed 4 times, most recent failure: Lost task 935.3 in stage 3.0 (TID 1883, node27.ib, executor 248):
java.net.MalformedURLException: For input string: "RC-a-shops.de"
at java.net.URL.<init>(URL.java:627)
at java.net.URL.<init>(URL.java:490)
at java.net.URL.<init>(URL.java:439)
RDD has around 1 Million edges and I'm running it in a cluster. Can someone please suggest how to get rid of this exception
Upvotes: 0
Views: 2065
Reputation: 33
The java.net.MalformedURLException: no protocol exception is also thrown when you have quotes in your string:
new Url("\"http:www.example.com\"")
Upvotes: 0
Reputation: 3965
EDIT: The question was edited to include what looks like a well-formed URL in the MalformedURLException. Regardless, my answer stands. The docs for URL suggest it will only throw MalformedURLException when the url is invalid in someway. More complete output would help in debugging this issue.
MalformedURLException - if no protocol is specified, or an unknown protocol is found, or spec is null.
It looks like your src
doesn't include the protocol of the URL. You need something like
http://whatever.com/nlp-agm.php
not just nlp-agm.php
.
A URL must be of the form
<scheme>://<authority><path>?<query>#<fragment>
where <scheme>
is required. new java.net.URL
will throw MalformedURLException
if the scheme is invalid or not specified. See more here: https://docs.oracle.com/javase/7/docs/api/java/net/URL.html#URL(java.lang.String)
Upvotes: 2