Reputation: 5576
I am making a RSS related app.
I want to be able to download RSS(xml) given only website URL that contains:
link rel="alternate" type="application/rss+xml"
For example, http://www.engaget.com source contains:
<link rel="alternate" type="application/rss+xml" title="Engadget" href="http://www.engadget.com/rss.xml">
I am assuming if I open this site as RSS application,
it will re-direct me to http://www.engadget.com/rss.xml page.
My code to download xml is following:
private boolean downloadXml(String url, String filename) {
try {
URL urlxml = new URL(url);
URLConnection ucon = urlxml.openConnection();
ucon.setConnectTimeout(4000);
ucon.setReadTimeout(4000);
InputStream is = ucon.getInputStream();
BufferedInputStream bis = new BufferedInputStream(is, 128);
FileOutputStream fOut = openFileOutput(filename + ".xml", Context.MODE_WORLD_READABLE | Context.MODE_WORLD_WRITEABLE);
OutputStreamWriter osw = new OutputStreamWriter(fOut);
int current = 0;
while ((current = bis.read()) != -1) {
osw.write((byte) current);
}
osw.flush();
osw.close();
} catch (Exception e) {
return false;
}
return true;
}
without me knowing 'http://www.engadget.com/rss.xml' url, how can I download RSS when I input 'http://www.engadget.com"?
Upvotes: 4
Views: 8384
Reputation: 8218
I guess the obvious answer is that you first fetch the URL you have (http://www.engadget.com), then look through the HTML to find a <link>
tag that has the right type, and then grab its href
attribute. Here is some (Java) code that does that:
URL url = new URL("http://www.engadget.com");
InputStream is = url.openStream();
int ptr = 0;
StringBuffer buffer = new StringBuffer();
while ((ptr = is.read()) != -1) {
buffer.append((char)ptr);
}
String html = buffer.toString();
Pattern rsspatt = Pattern.compile("<link[^>]*rss[^>]*>");
Matcher m = rsspatt.matcher(html);
String link = "";
if (m.find()) {
String rsslink = m.group();
Pattern xmllinkpatt = Pattern.compile("href=\"([^\"]+)\"");
Matcher m2 = xmllinkpatt.matcher(rsslink);
m2.find();
link = m2.group(1);
}
At the end of this, the variable link
will either be blank or contain the link you want, which you can feed into your downloadXml function.
Ordinarily I wouldn't recommend parsing HTML via regexes, but I assume this is for a phone app and you want to keep it simple and use only core as much as possible. Of course if you want to get fancy you can use Jsoup to check existence of the link tag and the right attribute and extract the link you want.
Upvotes: 1
Reputation: 6665
To accomplish this, you need to:
isHtml
method in the code below.extractRssUrl
method in the code below.The following code is a modified version of the code you pasted in your question. For I/O, I used Apache Commons IO for the useful IOUtils
and FileUtils
classes. IOUtils.toString
is used to convert an input stream to a string, as recommended in the article "In Java, how do I read/convert an InputStream to a String?"
extractRssUrl
uses regular expressions to parse HTML, even though it is highly frowned upon. (See the rant in "RegEx match open tags except XHTML self-contained tags.") With this in mind, let extractRssUrl
be a starting point. The regular expression in extractRssUrl
is rudimentary and doesn't cover all cases.
Note that a call to isRss(str)
is commented out. If you want to do RSS detection, see "How to detect if a page is an RSS or ATOM feed."
private boolean downloadXml(String url, String filename) {
InputStream is = null;
try {
URL urlxml = new URL(url);
URLConnection ucon = urlxml.openConnection();
ucon.setConnectTimeout(4000);
ucon.setReadTimeout(4000);
is = ucon.getInputStream();
String str = IOUtils.toString(is, "UTF-8");
if (isHtml(str)) {
String rssURL = extractRssUrl(str);
if (rssURL != null && !url.equals(rssURL)) {
return downloadXml(rssURL, filename + ".xml");
}
} else { // if (isRss(str)) {
// For now, we'll assume that we're an RSS feed at this point
FileUtils.write(new File(filename), str);
return true;
}
} catch (Exception e) {
// do nothing
} finally {
IOUtils.closeQuietly(is);
}
return false;
}
private boolean isHtml(String str) {
Pattern pattern = Pattern.compile("<html", Pattern.CASE_INSENSITIVE | Pattern.DOTALL | Pattern.MULTILINE);
Matcher matcher = pattern.matcher(str);
return matcher.find();
}
private String extractRssUrl(String str) {
Pattern pattern = Pattern.compile("<link(?:\\s+href=\"([^\"]*)\"|\\s+[a-z\\-]+=\"[^\"]*\")*\\s+type=\"application/rss\\+(?:xml|atom)\"(?:\\s+href=\"([^\"]*)\"|\\s+[a-z\\-]+=\"[^\"]*\")*?\\s*/?>", Pattern.CASE_INSENSITIVE | Pattern.DOTALL | Pattern.MULTILINE);
Matcher matcher = pattern.matcher(str);
if (matcher.find()) {
for (int i = 1; i <= matcher.groupCount(); i++) {
if (matcher.group(i) != null) {
return matcher.group(i);
}
}
}
return null;
}
The above code works with your Engadget example:
obj.downloadXml("http://www.engadget.com/", "rss");
Upvotes: 2