Afshin
Afshin

Reputation: 13

Regular Expression in python, a particular case

I newbie in python and I'm trying to extract a value from string but it doesn't work. my string is something like:

<a href=​"/​profile/​view?id=34232962&goback=%2Enmp_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1&trk=spm_pic" title=​"View your profile">​

My attempt is:

m = re.search('^.*\b(view|your|profile)\b.*$', newp, re.IGNORECASE)
print m.group(0)

The desired output:

/​profile/​view?id=34232962&goback=%2Enmp_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1 trk=spm_pic

Upvotes: 1

Views: 91

Answers (3)

mishik
mishik

Reputation: 10003

You need to know that * is greedy, i.e. it will try to match maximum number of chars. So in your example (if matching href only):

'^.*\b(view|your|profile)\b.*$'

.../​profile/​view?id=34232962&goback=%2Enmp_*1_*1_*...
^---------^ matched by '.*'
           ^ \b
            ^--^ matched by 'view'
                ^ \b
                 ^------... - matched by .*

If matching full string:

... title=​"View your profile">
^------------------^ - .*
                    ^ \b
                     ^-----^ - 'profile'
                            ^ \b
                             ^ - .*

Another note: ^.*<regex>.*$ is effectively the same as just <regex>

What you probably want is: href="([^"]*)" - this will match whatever is in href="..."

Upvotes: 0

Michael Davis
Michael Davis

Reputation: 2430

Regex is horrible for parsing HTML as you have found out. Use a tool built for the job. In the case of python, use BeautifulSoup.

soup = BeautifulSoup(html_doc)
profile_a = soup.find(title="View your profile")
link = profile_a['href']
print link
>> /​profile/​view?id=34232962&goback=%2Enmp_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1 trk=spm_pic

Upvotes: 3

Guillaume
Guillaume

Reputation: 10961

Ah? You want to crawl LinkedIn private pages? ;)

Something like that should work:

m = re.search('href="(/profile/[^"]+)"', newp, re.IGNORECASE)

But, as usual, don't use regular expressions to parse HTML.

Upvotes: 0

Related Questions