vidhan
vidhan

Reputation: 129

How to extract text from between html tag using Regular Expressions?

I need to extract text from between the textarea tag.

How can I do it using regular expressions?

<textarea rows="20" cols="70" name="file" id="file" style="width: 100%;"data-input-file="1">
 abc_text
 #include<abc>
 xyz
</textarea>

Upvotes: 0

Views: 3275

Answers (2)

Vivek Sable
Vivek Sable

Reputation: 10213

XML is not valid according to XML rules. Opening and ending tag mismatch.

#include<abc>

<abc> is opening tag, not content.

XML parsing libraries not going to parse invalid Input.


Modification Input:

If you change #include<abc> to #include&lt;abc&gt; then following will work:

>>> import lxml.html as PARSER
>>> root = PARSER.fromstring(data)
>>> root.xpath("//textarea/text()")
['\n abc_text\n #include<abc>\n xyz\n']
>>> 

By RE:

>>> data
'<textarea rows="20" cols="70" name="file" id="file" style="width: 100%;"data-input-file="1">\n abc_text\n</textarea>'
>>> import re
>>> re.findall('<textarea[^>]*>[^<]*</textarea>', data)
['<textarea rows="20" cols="70" name="file" id="file" style="width: 100%;"data-input-file="1">\n abc_text\n</textarea>']
>>> 

Upvotes: 1

Qiang Jin
Qiang Jin

Reputation: 4467

You can try,

>>> print [x.strip() for x in re.findall('<textarea.*?>(.*)</textarea>', content, re.MULTILINE | re.DOTALL)]
['abc_text\n #include<abc>\n xyz']

Upvotes: 3

Related Questions