wjie08
wjie08

Reputation: 445

Extract part of URL from column of URLs in python

I have a column of URLs and would like to retrieve the digits after the "/show" but before the next "/" and would like these digits to be in the form of integer

sn    URL
1     https://tvseries.net/show/51/johnny155
2     https://tvseries.net/show/213/kimble2
3     https://tvseries.net/show/46/forceps
4     https://tvseries.net/show/90/tr9
5     https://tvseries.net/show/22/candlenut

expected output is

sn    URL                                          show_id
1     https://tvseries.net/show/51/johnny155       51
2     https://tvseries.net/show/213/kimble2        213
3     https://tvseries.net/show/46/forceps         46 
4     https://tvseries.net/show/90/tr9             90
5     https://tvseries.net/show/22/candlenut       22

Currently, i've tried the following code to retrieve the digits after "show" and it is able to produce a column where the show_id is in brackets (i.e., [51], [213]) and its type is pandas.core.series.Series.

Is there a more efficient way to get the show_id in integer form without the brackets? Appreciate any form of help, thank you

import urllib.parse as urlparse

df['protocol'],df['domain'],df['path'], df['query'], df['fragment'] = zip(*df['URL'].map(urlparse.urlsplit))

df['UID'] = df['path'].str.findall(r'(?<=show)[^,.\d\n]+?(\d+)')

Upvotes: 1

Views: 303

Answers (1)

Nick
Nick

Reputation: 147286

You can use extract to create a column by using a capture group to match the digits between forward slashes after show:

df = pd.DataFrame({ 'sn' : [1, 2, 3, 4, 5], 
                   'URL': ['https://tvseries.net/show/51/johnny155',
                           'https://tvseries.net/show/213/kimble2',
                           'https://tvseries.net/show/46/forceps',
                           'https://tvseries.net/show/90/tr9',
                           'https://tvseries.net/show/22/candlenut'
                           ]})
df['show_id'] = df['URL'].str.extract('show/(\d+)/')
df

Output

   sn                                     URL show_id
0   1  https://tvseries.net/show/51/johnny155      51
1   2   https://tvseries.net/show/213/kimble2     213
2   3    https://tvseries.net/show/46/forceps      46
3   4        https://tvseries.net/show/90/tr9      90
4   5  https://tvseries.net/show/22/candlenut      22

Upvotes: 2

Related Questions