Jsoup begin parsing AFTER specified tag or start from bottom of page?

Question

I have a block of HTML that I am parsing with Jsoup, however, not all of it is relevant, and parsing the irrelevant parts throws off my data set.

On the site, there is a header that can change at any time. Within this header are links, but links that I don't care about. When Jsoup parses the document, it adds those thinks to my link array and throws off my values.

The HTML I am interested in comes after the tag.

I would like to be able to tell Jsoup to ignore everything above that tag. Is this possible? If not, I can work around this issue by beginning my parsing at the bottom of the document, but I'm not sure how I would go about that either.

My Jsoup query is as follows. Please ignore all the commented out lines and debugging statements, I've been trying to work this out for a while and still have the test code in.

       Thread getTitlesThread = new Thread() {
            public void run() {
                TitleResults titleArray =  new TitleResults();
                StringBuilder whole = new StringBuilder();

                try {
                    URL url = new URL(
                            Constants.FORUM);
                    HttpURLConnection urlConnection = (HttpURLConnection) url.openConnection();
                    urlConnection.setRequestProperty("User-Agent", "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.4; en-US; rv:1.9.2.2) Gecko/20100316 Firefox/3.6.2");
                    try {
                        BufferedReader in = new BufferedReader(
                            new InputStreamReader(new BufferedInputStream(urlConnection.getInputStream())));
                        String inputLine;
                        while ((inputLine = in.readLine()) != null)
                            whole.append(inputLine);
                        in.close();
                    } catch (IOException e) {}
                    finally {
                        urlConnection.disconnect();
                    }
                } catch (Exception e) {}
                Document doc = Parser.parse(whole.toString(), Constants.FORUM);
                Elements threads = doc.select("TOPICS > .topic_title");
                Elements authors = doc.select("a[hovercard-ref]");
//              for (Element author : authors) {
//                  authorArray.add(author.text());
//              }
//              cleanAuthors();
                if (threads.isEmpty()) {
                    Log.d("POC", "EMPTY BRO!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!11");
                }
//              for (Element thread : threads) {
//                  titleArray =  new TitleResults();
//                  Log.d("POC", thread.toString());
//
//                  titleArray.setAuthorDate(authorArray.get(0));
//                  authorArray.remove(0);

                    //Thread title
//                  threadTitle = thread.text();
//                  titleArray.setItemName(threadTitle);
//                  
//                  //Thread link
//                  String threadStr = thread.attr("abs:href");
//                  String endTag = "/page__view__getnewpost"; //trim link
//                  threadStr = new String(threadStr.replace(endTag, ""));
//                  threadArray.add(threadStr);
//                  results.add(titleArray);
//              }
           } 
        };
        getTitlesThread.start();

r2DoesInc · Accepted Answer

Remove the part of the document that you don't want to parse with:

Document doc = Parser.parse(whole.toString().replaceAll("?.*?", ""), Constants.FORUM);

Where was the beginning of what I wanted to ignore and was the end.

Jsoup begin parsing AFTER specified tag or start from bottom of page?

Answers (2)

Related Questions