Reputation: 513
I'm trying to retrieve all the tags containing a 'name' field, and then treat the whole sentence plus the name. This is the test code I have:
sourceCode = '<dirtfields name="one" value="stuff">\n<gibberish name="two"\nwewt>'
namesGroup = re.findall('<.*name="(.*?)".*>', sourceCode, re.IGNORECASE | re.DOTALL)
for name in namesGroup:
print name
Which output is:
two
And the output I am trying to look for would be
['<dirtfields name="one" value="stuff">', 'one']
['<gibberish name="two"\nwewt>', 'two']
EDIT: Found a way to do it, thanks to doublesharp for the cleaner way to get the 'name' value.
namesGroup = re.findall(r'(<.*?name="([^"]*)".*?>)', sourceCode, re.IGNORECASE | re.DOTALL)
Which will output:
('<dirtfields name="one" value="stuff">', 'one')
('<gibberish name="two"\nwewt>', 'two')
Upvotes: 3
Views: 8780
Reputation: 89547
It is a pattern that allows escaped quotes inside the value and that avoid (for performance reasons) lazy quantifiers. This is the reason why it's a bit long but more waterproof:
myreg = re.compile(r"""
< (?: [^n>]+ | \Bn | n(?!ame\s*=) )+ # begining of the tag
# until the name attribute
name \s* = \s* ["']? # attribute until the value
( (?: [^\s\\"']+ | \\{2} | \\. )* ) # value
[^>]*> # end of the tag
""", re.X | re.I | re.S)
namesGroup = myreg.findall(sourceCode)
However using BS4 is a nice solution for your case.
Upvotes: 0
Reputation: 19628
Clearly you are dealing with HTML
or XML
file and looking for some values of specific attribute.
You will make a directional mistake if you keep working with regular expressions instead of a legit text parser.
Like BeautifulSoup4, the one I like the most, here is an very brief example of how to use it:
from bs4 import BeautifulSoup
sourceCode = '<dirtfields name="one" value="stuff">\n<gibberish name="two"\nwewt>'
soup = BeautifulSoup(sourceCode)
print soup.prettify()
print '------------------------'
for tag in soup.find_all():
if tag.has_key('name'):
print tag, tag['name']
The output looks a bit ugly now (the output is even wrong), but this shows that how beautifulsoup will auto fix your broken html and easily locate the attribute you want.
<html>
<body>
<dirtfields name="one" value="stuff">
<gibberish name="two" wewt="">
</gibberish>
</dirtfields>
</body>
</html>
------------------------
<dirtfields name="one" value="stuff">
<gibberish name="two" wewt=""></gibberish></dirtfields> one
<gibberish name="two" wewt=""></gibberish> two
Add Beautifulsoup
to your favorite Stackoverflow tags and you will be surprise how good it is and how many people are doing the same thing as you with a more powerful tool!
Upvotes: 2
Reputation: 27599
Your regex is a bit off - you are matching too much (all the way to the last >
). Since you just need to values between the double quotes after name=
use the following pattern:
name="([^"]*)"
name="
matches the first part of the attribute you are looking for([^"]*)
creates a grouped match based on any characters that are not a double quote"
matches the double quote after the name attribute value.And your code would look like this (it's good form to include an r
before your pattern):
namesGroup = re.findall(r'name="([^"]*)"', sourceCode, re.IGNORECASE)
Upvotes: 4
Reputation: 4795
(?<=name=")[^"]*
If you wanted to match only the name without having a capture group, you could use:
re.findall(r'(?<=name=")[^"]*', sourceCode, re.IGNORECASE )
Output: ['one', 'two']
Of course capture groups are an equally acceptable solution.
Upvotes: 2