Ivy Lin
Ivy Lin

Reputation: 123

Beautifulsoup extract string partially

I am new to Beautifulsoup 4 and found it really convenient! However, I got the problem when I need to split the string:

An example here:

I have a link which is

 <a href="http://nihao-wobuhao?%93%23%24%12&sort=102">NIHAO</a>

I get the line with soap.findChildren('a'), but what if I just need the part 'sort=102'?

I tried to use soap.find_all(re.compile(^sort=.*?))but it does not work, can anyone help me with that? Thanks in advance!

Upvotes: 1

Views: 118

Answers (2)

alecxe
alecxe

Reputation: 473863

To elaborate a little bit to @Don's answer:

  • locate the a element by, for example, text
  • get the href attribute value using a dictionary-like access to Tag's attributes
  • use urlparse.parse_qs() to get the url query parameters

Working sample:

>>> from bs4 import BeautifulSoup
>>> from urlparse import urlparse, parse_qs
>>>
>>> html = '<a href="http://nihao-wobuhao?%93%23%24%12&sort=102">NIHAO</a>'
>>> parse_qs(urlparse(soup.find("a", text="NIHAO")['href']).query)['sort'][0]
u'102'

Note that in Python 3, you would need to change the urlparse import to:

>>> from urllib.parse import urlparse, parse_qs

Upvotes: 0

Don Kirkby
Don Kirkby

Reputation: 56640

The urlparse module will pick out the pieces of a URL. You could use that to get the query parameter you're looking for.

Upvotes: 1

Related Questions