Reputation: 13
I newbie in python and I'm trying to extract a value from string but it doesn't work. my string is something like:
<a href="/profile/view?id=34232962&goback=%2Enmp_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1&trk=spm_pic" title="View your profile">
My attempt is:
m = re.search('^.*\b(view|your|profile)\b.*$', newp, re.IGNORECASE)
print m.group(0)
The desired output:
/profile/view?id=34232962&goback=%2Enmp_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1 trk=spm_pic
Upvotes: 1
Views: 91
Reputation: 10003
You need to know that *
is greedy, i.e. it will try to match maximum number of chars. So in your example (if matching href only):
'^.*\b(view|your|profile)\b.*$'
.../profile/view?id=34232962&goback=%2Enmp_*1_*1_*...
^---------^ matched by '.*'
^ \b
^--^ matched by 'view'
^ \b
^------... - matched by .*
If matching full string:
... title="View your profile">
^------------------^ - .*
^ \b
^-----^ - 'profile'
^ \b
^ - .*
Another note: ^.*<regex>.*$
is effectively the same as just <regex>
What you probably want is: href="([^"]*)"
- this will match whatever is in href="..."
Upvotes: 0
Reputation: 2430
Regex is horrible for parsing HTML as you have found out. Use a tool built for the job. In the case of python, use BeautifulSoup.
soup = BeautifulSoup(html_doc)
profile_a = soup.find(title="View your profile")
link = profile_a['href']
print link
>> /profile/view?id=34232962&goback=%2Enmp_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1 trk=spm_pic
Upvotes: 3
Reputation: 10961
Ah? You want to crawl LinkedIn private pages? ;)
Something like that should work:
m = re.search('href="(/profile/[^"]+)"', newp, re.IGNORECASE)
But, as usual, don't use regular expressions to parse HTML.
Upvotes: 0