Reputation: 23
I can't for the life of me figure out how to accomplish this task with TextPipe.
TASK:
Extract (cut out) this TEXT including the start and end tag and get a file containing only these tags and the text in between.
<div><div class="article">`TEXT`<span id="contentBottomLeft"></span>
I defined a restriction filter with an end and start tag, but what's next? This filter demands a subfilter and I don't understand what exact filter I need to use next and how to customize it. I need to repeat this extraction process for several thousands of HTML files.
Steps specifically for TextPipe will be greatly appreaciated, as I'm not much a of a programmer myself.
Upvotes: 2
Views: 1158
Reputation: 51
This is pretty easy with TextPipe, which BTW is awesome.
Add a perl search and replace pattern filter, with search text of:
<div><div class="article">[^<]*<span id="contentBottomLeft"></span>
Set the replace text to:
$0\r\n
Then, simply check the 'Extract matches' option of the search/replace filter.
Finally, in the Output Filter, use the 'Single File Output' to your destination filename.
Upvotes: 5
Reputation: 126722
Without any further help from yourself, I can only guess that you want to remove all <div> elements whose first child is another <div> element with a class attribute equal to "article".
After a quick look at the TextPipe documentation it looks like it won't do anything like XPath expressions, but you should experiment with a Restrict to between tags
filter and a Remove All
subfilter.
Bear in mind that it is possible that TextPipe won't do what you want and you may have to look elsewhere for a solution.
Upvotes: 2