Reputation: 551
I have the following issue with JSoup.
I want to parse and modify the following html code:
<code>
<style type="text/css" media="all">
@import url("http://hakkon-aetterni.at/modules/system/system.base.css?ll3lgd");
@import url("http://hakkon-aetterni.at/modules/system/system.menus.css?ll3lgd");
@import url("http://hakkon-aetterni.at/modules/system/system.messages.css?ll3lgd");
@import url("http://hakkon-aetterni.at/modules/system/system.theme.css?ll3lgd");
</style>
</code>
I'm using the following Code to acheive that:
Elements cssImports= doc.select("style");
for (Element src : cssImports) {
String regex ="url\\(\"(.)*\"\\)";
String data =src.data();
String link;
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(data);
while (m.find()){
link=m.group().substring(5,m.group().length()-2);
doc=Jsoup.parse(doc.html().replace(link, ""));
}
}
First, it works. All the import urls are replaced with the String "FOUND
". The issue I'm having is that I get a lot new lines between the last import statement and the closed </style>
Tag which where not there before.
Any clues why this is happenign and how I can avoid it?
Sorry for the bad formatting but I seems like some parts of my code is just getting removed on posting. There is a style Tag surrounding the first code block...
Upvotes: 1
Views: 1640
Reputation: 10843
Well, I landed on this page today looking to do a very similar thing, and I believe that I've solved it. Hopefully someone's still watching this now that it's a month later. ;)
What I found to work well was, instead of doing string replaces and re-parsing the document on every loop, to rebuild the content of the style
element. One of the places where JSoup really shines is in how easy it's API makes editing a parsed document.
The other trick, is to use the data()
function. JSoup differentiates between data (e.g. script
and style
) and html/text nodes. The main difference is that HTML escaping is not applied to data nodes.
Putting all this together, this following code snippet should replace your imported stylesheet refs with your FOUND
text but without changing the formatting of your document:
// compile the regex before entering the loop, as it's a relatively expensive operation
Pattern pattern = Pattern.compile("url\\(\"(.)*\"\\)");
for(Element styleElem : doc.getElementsByTag("style")) {
String data = styleElem.data();
StringBuffer newData = new StringBuffer();
Matcher matcher = pattern.matcher(data);
while(matcher.find()) {
matcher.appendReplacement(newData, "FOUND");
}
matcher.appendTail(newData);
styleElem.appendChild(new DataNode(newData.toString(), base.toExternalForm()));
}
P.S. I'm assuming that you've turned pretty-printing off. Since your document parsing code isn't displayed, though, make doubly sure to call document.outputSettings().prettyPrint(false);
after parsing.
P.P.S. In my own code, I'm using a more tolerant (and slightly uglier) regex to find the imports. It lets the user get away with omitting the URL declaration, quotes, parens, etc...because HTML in the wild tends to do all of those things. I have it declared in my code as follows:
public static final Pattern CSS_IMPORT_PATTERN = Pattern.compile("(@import\\s+(?:url)?\\s*\\(?\\s*['\"]?)(.*?)([\\s'\";,)]|$)");
Upvotes: 2