Reputation: 129
I need to extract text from between the textarea tag.
How can I do it using regular expressions?
<textarea rows="20" cols="70" name="file" id="file" style="width: 100%;"data-input-file="1">
abc_text
#include<abc>
xyz
</textarea>
Upvotes: 0
Views: 3275
Reputation: 10213
XML is not valid according to XML rules. Opening and ending tag mismatch.
#include<abc>
<abc>
is opening tag, not content.
XML parsing libraries not going to parse invalid Input.
Modification Input:
If you change #include<abc>
to #include<abc>
then following will work:
>>> import lxml.html as PARSER
>>> root = PARSER.fromstring(data)
>>> root.xpath("//textarea/text()")
['\n abc_text\n #include<abc>\n xyz\n']
>>>
By RE:
>>> data
'<textarea rows="20" cols="70" name="file" id="file" style="width: 100%;"data-input-file="1">\n abc_text\n</textarea>'
>>> import re
>>> re.findall('<textarea[^>]*>[^<]*</textarea>', data)
['<textarea rows="20" cols="70" name="file" id="file" style="width: 100%;"data-input-file="1">\n abc_text\n</textarea>']
>>>
Upvotes: 1
Reputation: 4467
You can try,
>>> print [x.strip() for x in re.findall('<textarea.*?>(.*)</textarea>', content, re.MULTILINE | re.DOTALL)]
['abc_text\n #include<abc>\n xyz']
Upvotes: 3