Andrew
Andrew

Reputation: 23

Textpipe: extracting text between two tags

I can't for the life of me figure out how to accomplish this task with TextPipe.

TASK:

Extract (cut out) this TEXT including the start and end tag and get a file containing only these tags and the text in between.

<div><div class="article">`TEXT`<span id="contentBottomLeft"></span>

I defined a restriction filter with an end and start tag, but what's next? This filter demands a subfilter and I don't understand what exact filter I need to use next and how to customize it. I need to repeat this extraction process for several thousands of HTML files.

Steps specifically for TextPipe will be greatly appreaciated, as I'm not much a of a programmer myself.

Upvotes: 2

Views: 1158

Answers (2)

Simon
Simon

Reputation: 51

This is pretty easy with TextPipe, which BTW is awesome.

Add a perl search and replace pattern filter, with search text of:

<div><div class="article">[^<]*<span id="contentBottomLeft"></span>
  • here, TEXT can be any characters except a '<' - this makes the pattern faster.

Set the replace text to:

$0\r\n

Then, simply check the 'Extract matches' option of the search/replace filter.

Finally, in the Output Filter, use the 'Single File Output' to your destination filename.

Upvotes: 5

Borodin
Borodin

Reputation: 126722

Without any further help from yourself, I can only guess that you want to remove all <div> elements whose first child is another <div> element with a class attribute equal to "article".

After a quick look at the TextPipe documentation it looks like it won't do anything like XPath expressions, but you should experiment with a Restrict to between tags filter and a Remove All subfilter.

Bear in mind that it is possible that TextPipe won't do what you want and you may have to look elsewhere for a solution.

Upvotes: 2

Related Questions