Xingsheng Guo
Xingsheng Guo

Reputation: 71

Get all text content between tags from a URL?

By having a URL link. For example: http://www.engineersireland.ie/home.aspx

I can read them in by using java built in java.net.URL or Jsoup.

Then, I need to extract all text content between tags after tag.

There will be tags within tags. What all I need is the text in the middle.

for example:

    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
     <head id="head"><title>
        Engineers Ireland - Home
     </title><meta http-equiv="content-type" content="text/html; charset=UTF-8" /> 
    <meta http-equiv="pragma" content="no-cache" /> 
    <meta http-equiv="content-style-type" content="text/css" /> 
    <meta http-equiv="content-script-type" content="text/javascript" /> 

    <link href="/favicon.ico" type="image/x-icon" rel="shortcut icon"/> 
    <link href="/favicon.ico" type="image/x-icon" rel="icon"/>
<body>
<div class="module-content">

        <p id="1">Members can login for access to exclusive content, event booking, shop discounts and more...</p>

            <fieldset>
                <legend>Your Login Details</legend>
                <div class="formline">
                    <label for="1" id="1">Your Membership Number</label>
                    <input name="1" type="text" id="1" title="Your Membership Number" class="login-username clearlabel" />
                    <span id="1e" class="ErrorLabel" style="display:none;">Enter your membership number</span>
                </div>
                <div class="formline">
                    <label for="1" id="adasdasd">Password</label>
                    <input name="asdasd" type="password" id="dfbsdf" title="Password" class="login-password clearlabel" />
                    <span id="drthd" class="ErrorLabel" style="display:none;">Enter your password</span>
                </div>
                <div class="formline">
                    <input name="aseresrr" type="checkbox" id="bstg" class="login-remember" />
                    <label for="ryjmf" id="asrats" class="remember">Remember Me</label>

                    <div class="button grey">
                        <input type="submit" name="fgn" value="LOGIN" onclick="sdf;, false, false))" id="sdfsdf" />
                    </div>
                </div>

            </fieldset>
        <ul class="arrow">
            <li><a href="/site/reset-password.aspx">Forgot your password?</a></li>
            <li><a href="/membership/apply.aspx">Haven't registered yet?</a></li>
        </ul>
    </div>
</body>
</html>

From this html code, all I need are just:

Your Membership Number
Enter your membership number
Password
Enter your password
Remember Me

Other thing is that,

Keep in mind, the tag names and the number of tag are always random depend on the web page iteself.

Any help? By using Jsoup or java? Thx

Upvotes: 1

Views: 2343

Answers (2)

nivekastoreth
nivekastoreth

Reputation: 1427

With the following, you can specify which section of the document you want to extract text from by passing in the correct CSS query to the getStringsFromUrl method. To search the whole document pass in null.

import org.jsoup.Jsoup;
import org.jsoup.helper.StringUtil;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Node;
import org.jsoup.nodes.TextNode;
import org.jsoup.select.Elements;
import org.jsoup.select.NodeVisitor;

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

public class JSoupTest {
    /*
     Outputs:
        Members can login for access to exclusive content, event booking, shop discounts and more...
        Your Login Details
        Your Membership Number
        Enter your membership number
        Password
        Enter your password
        Remember Me
        Forgot your password?
        Haven't registered yet?
     */
    public static void main(String[] args) throws IOException {
        String url = "http://localhost/test.html";
        List<String> strings = getStringsFromUrl(url, null);
        for(String string : strings) {
            System.out.println(string);
        }
    }

    private static List<String> getStringsFromUrl(String url, String cssQuery) throws IOException {
        Document document = Jsoup.connect(url).get();
        Elements elements = StringUtil.isBlank(cssQuery)
                ? document.getElementsByTag("body")
                : document.select(cssQuery);

        List<String> strings = new ArrayList<String>();
        elements.traverse(new TextNodeExtractor(strings));
        return strings;
    }

    private static class TextNodeExtractor implements NodeVisitor {
        private final List<String> strings;

        public TextNodeExtractor(List<String> strings) {
            this.strings = strings;
        }

        @Override
        public void head(Node node, int depth) {
            if (node instanceof TextNode) {
                TextNode textNode = ((TextNode) node);
                String text = textNode.getWholeText();
                if (!StringUtil.isBlank(text)) {
                    strings.add(text);
                }
            }
        }

        @Override
        public void tail(Node node, int depth) {}
    }
}

Upvotes: 2

Vishvesh Phadnis
Vishvesh Phadnis

Reputation: 2578

Use HtmlUnit Libraries in java so that you can find tag content of your choice.

Please visit Below link :

http://htmlunit.sourceforge.net/gettingStarted.html

Upvotes: 0

Related Questions