Extract text between specified html chunks in python

Question

I have below piece of html and need to extract only text from there between

Current
 and Archive

Html chunk looks like:

Current
File1


File2


File3


Archive
Some another file

so the desired output should looks like File1, File2, File3.

This is what I've tried so far

import re
m = re.compile('Current
(.*?)Archive').search(text)

but doesn't work as expected.

Is there any simple solution how to extract text between specified chunks of html tags in python?

Patrick Artner · Accepted Answer

If you insist upon using regex you can use it in combination with list comp like so:

chunk="""Current
File1


File2


File3


Archive
Some another file"""

import re 

# find all things between > and < the shorter the better  
found = re.findall(r">(.+?)<",chunk) 

# only use the stuff after "Current" before "Archive"
found[:] = found[ found.index("Current")+1:found.index("Archive")]

print(found) # python 3 syntax, remove () for python2.7

Output:

['File1', 'File2', 'File3']

Extract text between specified html chunks in python

Answers (2)

Related Questions