Reputation:
What I am doing is validating URLs from my code. So I have a file with url's in it and I want to see if they exist or not. If they exist, the web page contains xml code in which there will be an email address I want to extract. I go round a while loop and in each instance, if the url exists, The xml is added to a string. This one big string contains the xml code. What I want to do is extract the email address from this string with the xml code in it. I can't use the methods in the string api as they require you to specify the sarting index which I don't know as it varies each time.
What I was hoping to do was search the string for a sub-string starting with (e.g. "<email id>
") and ending with (e.g. "</email id>
") and add the string between these strings to a seperate string.
Does anyone know if this is possible to do or if there is an easier/different way of doing what I want to do?
Thanks.
Upvotes: 0
Views: 1439
Reputation: 1382
If I understand your question correctly you are extracting pieces of XML from multiple web pages and concatenating them into a big 'xml' string,
something that looks like
"<somedata
>blah</somedata
>
<email
>[email protected]</email
>
<somedata
>blah</somedata
>
<somedata
>blah</somedata
>
<email
>[email protected]</email
>
<somedata
>blah</somedata
>
<somedata
>blah</somedata
>
<email
>[email protected]</email
>
<somedata
>blah</somedata
>
<somedata
>blah</somedata
>
"
I'd advise making that a somewhat valid xml document by including a root element.
"
<?xml version="1.0" encoding="ISO-8859-1"?
>
<newRoot
>
<somedata
>blah</somedata
>
<email
>[email protected]</email
>
<somedata
>blah</somedata
>
<somedata
>blah</somedata
>
<email
>[email protected]</email
>
<somedata
>blah</somedata
>
<somedata
>blah</somedata
>
<email
>[email protected]</email
>
<somedata
>blah</somedata
>
<somedata
>blah</somedata
>
</newroot
>"
Then you could load that into an Xml Document object and can use Xpath expressions to extract the email nodes and their values.
If you don't want to do that that you could use the indexOf(String str, int fromIndex)
method to find the <email
> and </email
> (or whatever they are called) positions. and then substring based on those. That's not a particularly clean or easy to read way of doing it though.
Upvotes: 0
Reputation: 317
If you know well the structure of the XML document, I'll recommand to use XPath.
For example, with emails contained in <email>[email protected]</email>, there will a XPath request like /root/email (depends on your xml structure)
By executing this XPath query on your XML file, you will automatically get all <email> element (Node) returned in an array. And if you have XML element, you have XML content. (#getNodeValue)
Upvotes: 4
Reputation: 10151
Check out the org.xml.sax API. It is very easy to use and allows you to parse through XML and do whatever you want with the contents whenever you come across anything of interest. So you could easily add some logic to look for < email > start elements then save the contents (characters) which will contain your email address.
Upvotes: 0
Reputation: 20142
A regular expression that will find and return strings between two " characters:
import java.util.regex.Pattern;
import java.util.regex.Matcher;
private final static Pattern pattern = Pattern.compile("\"(.*?)\"");
private void doStuffWithStringsBetweenQuotes(String source) {
Matcher matcher = pattern.matcher(source);
while (matcher.find()) {
String match = matcher.group(1);
}
}
Upvotes: 2
Reputation: 24788
Have you try to use Regex? Probably a sample document will be very useful for this kind of question.
Upvotes: 0
Reputation: 55062
To answer your subject question: .indexOf, or, regular expressions.
But after a brief review of your question, you should really be processing the XML document properly.
Upvotes: 3