Beautiful Soup-Regex-How to extract particular part from href in<a> tag

Hey so I am using Beautiful soup to make a scrapper which aims to extract the id of an app searched on playstore. The code:

def linkgen(name):
    base = "https://play.google.com/store/search?q="
    req = requests.get(base + name)
    soup = BeautifulSoup(req.content, "html.parser")
    soup2=soup.find( class_ = "Si6A0c Gy4nib" )
    print(soup2)

The output generated:

<a class="Si6A0c Gy4nib" href="/store/apps/details?id=com.facebook.katana" jslog="38003; 1:575|CBSqARUKEwjwyfy+1fj6AhXGZI4KHfF0AA8=; track:click,impression"><div class="Shbxxd"><img alt="Screenshot image" aria-hidden="true" class="T75of jpDEN" loading="lazy" src="https://play-lh.googleusercontent.com/9s-9zONYk4NZvLlHVMIF5cGCzrx7PjZYQ3uow5P8Rj2Mt_XHWygV3gOt75_iI1YtTg=w416-h235" srcset="https://play-lh.googleusercontent.com/9s-9zONYk4NZvLlHVMIF5cGCzrx7PjZYQ3uow5P8Rj2Mt_XHWygV3gOt75_iI1YtTg=w832-h470 2x"/></div><div class="j2FCNc"><img alt="Thumbnail image" aria-hidden="true" class="T75of stzEZd" loading="lazy" src="https://play-lh.googleusercontent.com/ccWDU4A7fX1R24v-vvT480ySh26AYp97g1VrIB_FIdjRcuQB2JP2WdY7h_wVVAeSpg=s64" srcset="https://play-lh.googleusercontent.com/ccWDU4A7fX1R24v-vvT480ySh26AYp97g1VrIB_FIdjRcuQB2JP2WdY7h_wVVAeSpg=s128 2x"/><div class="cXFu1"><div class="ubGTjb"><span class="DdYX5">Facebook</span></div><div class="ubGTjb"><span class="wMUdtb">Meta Platforms, Inc.</span></div><div class="ubGTjb"><div aria-label="Rated 3.2 stars out of five stars" style="display: inline-flex; align-items: center;"><span class="w2kbF">3.2</span><span class="Q4fJQd"><i aria-hidden="true" class="google-material-icons Yvy3Fd">star</i></span></div></div></div></div></a>

Out of this output I want to extract the id present in the href link(For this case I want to extract "com.facebook.katana"). I have tried searching for href in a tag and tried using regex as well but couldn't get any output. Anyone?
Thank you

Upvotes: 0

Views: 61

Answers (1)

jontec
jontec

Reputation: 111

To get only href tag content you can try using this regex sample in your python code:

r"(?<=id=)(.*?)(\")"

Then remove the last char at the end of the string. If you want to try the regex just go here :)

Hopes this will help you! Have a nice day.

Upvotes: 2

Related Questions