Meeeee
Meeeee

Reputation:

How to extract a substring from a string in java

What I am doing is validating URLs from my code. So I have a file with url's in it and I want to see if they exist or not. If they exist, the web page contains xml code in which there will be an email address I want to extract. I go round a while loop and in each instance, if the url exists, The xml is added to a string. This one big string contains the xml code. What I want to do is extract the email address from this string with the xml code in it. I can't use the methods in the string api as they require you to specify the sarting index which I don't know as it varies each time.

What I was hoping to do was search the string for a sub-string starting with (e.g. "<email id>") and ending with (e.g. "</email id>") and add the string between these strings to a seperate string.

Does anyone know if this is possible to do or if there is an easier/different way of doing what I want to do?

Thanks.

Upvotes: 0

Views: 1439

Answers (6)

TygerKrash
TygerKrash

Reputation: 1382

If I understand your question correctly you are extracting pieces of XML from multiple web pages and concatenating them into a big 'xml' string,

something that looks like


"<somedata>blah</somedata>
<email>[email protected]</email>
<somedata>blah</somedata>
<somedata>blah</somedata>
<email>[email protected]</email>
<somedata>blah</somedata>
<somedata>blah</somedata>
<email>[email protected]</email>
<somedata>blah</somedata>
<somedata>blah</somedata>
"

I'd advise making that a somewhat valid xml document by including a root element.

" <?xml version="1.0" encoding="ISO-8859-1"?> <newRoot> <somedata>blah</somedata> <email>[email protected]</email> <somedata>blah</somedata> <somedata>blah</somedata> <email>[email protected]</email> <somedata>blah</somedata> <somedata>blah</somedata> <email>[email protected]</email> <somedata>blah</somedata> <somedata>blah</somedata> </newroot>"

Then you could load that into an Xml Document object and can use Xpath expressions to extract the email nodes and their values.

If you don't want to do that that you could use the indexOf(String str, int fromIndex) method to find the <email> and </email> (or whatever they are called) positions. and then substring based on those. That's not a particularly clean or easy to read way of doing it though.

Upvotes: 0

ipingu
ipingu

Reputation: 317

If you know well the structure of the XML document, I'll recommand to use XPath.

For example, with emails contained in <email>[email protected]</email>, there will a XPath request like /root/email (depends on your xml structure)

By executing this XPath query on your XML file, you will automatically get all <email> element (Node) returned in an array. And if you have XML element, you have XML content. (#getNodeValue)

Upvotes: 4

DaveJohnston
DaveJohnston

Reputation: 10151

Check out the org.xml.sax API. It is very easy to use and allows you to parse through XML and do whatever you want with the contents whenever you come across anything of interest. So you could easily add some logic to look for < email > start elements then save the contents (characters) which will contain your email address.

Upvotes: 0

Avi
Avi

Reputation: 20142

A regular expression that will find and return strings between two " characters:

import java.util.regex.Pattern;
import java.util.regex.Matcher;

private final static Pattern pattern = Pattern.compile("\"(.*?)\"");

private void doStuffWithStringsBetweenQuotes(String source) {
    Matcher matcher = pattern.matcher(source);
    while (matcher.find()) {
        String match = matcher.group(1);
    }
}

Upvotes: 2

nanda
nanda

Reputation: 24788

Have you try to use Regex? Probably a sample document will be very useful for this kind of question.

Upvotes: 0

Noon Silk
Noon Silk

Reputation: 55062

To answer your subject question: .indexOf, or, regular expressions.

But after a brief review of your question, you should really be processing the XML document properly.

Upvotes: 3

Related Questions