jclova
jclova

Reputation: 5576

Android: How to download RSS when a website contains: link rel="alternate" type="application/rss+xml"

I am making a RSS related app.
I want to be able to download RSS(xml) given only website URL that contains:

link rel="alternate" type="application/rss+xml"

For example, http://www.engaget.com source contains:

<link rel="alternate" type="application/rss+xml" title="Engadget" href="http://www.engadget.com/rss.xml">

I am assuming if I open this site as RSS application,
it will re-direct me to http://www.engadget.com/rss.xml page.

My code to download xml is following:

private boolean downloadXml(String url, String filename) {
        try {
            URL   urlxml = new URL(url);
            URLConnection ucon = urlxml.openConnection();
            ucon.setConnectTimeout(4000);
            ucon.setReadTimeout(4000);
            InputStream is = ucon.getInputStream();
            BufferedInputStream bis = new BufferedInputStream(is, 128);
            FileOutputStream fOut = openFileOutput(filename + ".xml", Context.MODE_WORLD_READABLE | Context.MODE_WORLD_WRITEABLE);
            OutputStreamWriter osw = new OutputStreamWriter(fOut);
            int current = 0;
            while ((current = bis.read()) != -1) {
                osw.write((byte) current);
            }
            osw.flush();
            osw.close();

        } catch (Exception e) {
            return false;
        }
        return true;
    }

without me knowing 'http://www.engadget.com/rss.xml' url, how can I download RSS when I input 'http://www.engadget.com"?

Upvotes: 4

Views: 8384

Answers (2)

Ansari
Ansari

Reputation: 8218

I guess the obvious answer is that you first fetch the URL you have (http://www.engadget.com), then look through the HTML to find a <link> tag that has the right type, and then grab its href attribute. Here is some (Java) code that does that:

URL url = new URL("http://www.engadget.com");
InputStream is = url.openStream();
int ptr = 0;
StringBuffer buffer = new StringBuffer();
while ((ptr = is.read()) != -1) {
  buffer.append((char)ptr);
}
String html = buffer.toString();
Pattern rsspatt = Pattern.compile("<link[^>]*rss[^>]*>");
Matcher m = rsspatt.matcher(html);
String link = "";
if (m.find()) {
  String rsslink = m.group();
  Pattern xmllinkpatt = Pattern.compile("href=\"([^\"]+)\"");
  Matcher m2 = xmllinkpatt.matcher(rsslink);
  m2.find();
  link = m2.group(1);
}

At the end of this, the variable link will either be blank or contain the link you want, which you can feed into your downloadXml function.

Ordinarily I wouldn't recommend parsing HTML via regexes, but I assume this is for a phone app and you want to keep it simple and use only core as much as possible. Of course if you want to get fancy you can use Jsoup to check existence of the link tag and the right attribute and extract the link you want.

Upvotes: 1

creemama
creemama

Reputation: 6665

To accomplish this, you need to:

  1. Detect whether the URL points to an HTML file. See the isHtml method in the code below.
  2. If the URL points to an HTML file, extract an RSS URL from it. See the extractRssUrl method in the code below.

The following code is a modified version of the code you pasted in your question. For I/O, I used Apache Commons IO for the useful IOUtils and FileUtils classes. IOUtils.toString is used to convert an input stream to a string, as recommended in the article "In Java, how do I read/convert an InputStream to a String?"

extractRssUrl uses regular expressions to parse HTML, even though it is highly frowned upon. (See the rant in "RegEx match open tags except XHTML self-contained tags.") With this in mind, let extractRssUrl be a starting point. The regular expression in extractRssUrl is rudimentary and doesn't cover all cases.

Note that a call to isRss(str) is commented out. If you want to do RSS detection, see "How to detect if a page is an RSS or ATOM feed."

private boolean downloadXml(String url, String filename) {
    InputStream is = null;
    try {
        URL urlxml = new URL(url);
        URLConnection ucon = urlxml.openConnection();
        ucon.setConnectTimeout(4000);
        ucon.setReadTimeout(4000);
        is = ucon.getInputStream();
        String str = IOUtils.toString(is, "UTF-8");
        if (isHtml(str)) {
            String rssURL = extractRssUrl(str);
            if (rssURL != null && !url.equals(rssURL)) {
                return downloadXml(rssURL, filename + ".xml");
            }
        } else { // if (isRss(str)) {
            // For now, we'll assume that we're an RSS feed at this point
            FileUtils.write(new File(filename), str);
            return true;
        }
    } catch (Exception e) {
        // do nothing
    } finally {
        IOUtils.closeQuietly(is);
    }
    return false;
}

private boolean isHtml(String str) {
    Pattern pattern = Pattern.compile("<html", Pattern.CASE_INSENSITIVE | Pattern.DOTALL | Pattern.MULTILINE);
    Matcher matcher = pattern.matcher(str);
    return matcher.find();
}

private String extractRssUrl(String str) {
    Pattern pattern = Pattern.compile("<link(?:\\s+href=\"([^\"]*)\"|\\s+[a-z\\-]+=\"[^\"]*\")*\\s+type=\"application/rss\\+(?:xml|atom)\"(?:\\s+href=\"([^\"]*)\"|\\s+[a-z\\-]+=\"[^\"]*\")*?\\s*/?>", Pattern.CASE_INSENSITIVE | Pattern.DOTALL | Pattern.MULTILINE);
    Matcher matcher = pattern.matcher(str);
    if (matcher.find()) {
        for (int i = 1; i <= matcher.groupCount(); i++) {
            if (matcher.group(i) != null) {
                return matcher.group(i);
            }
        }
    }
    return null;
}

The above code works with your Engadget example:

obj.downloadXml("http://www.engadget.com/", "rss");

Upvotes: 2

Related Questions