Reputation: 445
I have a column of URLs and would like to retrieve the digits after the "/show" but before the next "/" and would like these digits to be in the form of integer
sn URL
1 https://tvseries.net/show/51/johnny155
2 https://tvseries.net/show/213/kimble2
3 https://tvseries.net/show/46/forceps
4 https://tvseries.net/show/90/tr9
5 https://tvseries.net/show/22/candlenut
expected output is
sn URL show_id
1 https://tvseries.net/show/51/johnny155 51
2 https://tvseries.net/show/213/kimble2 213
3 https://tvseries.net/show/46/forceps 46
4 https://tvseries.net/show/90/tr9 90
5 https://tvseries.net/show/22/candlenut 22
Currently, i've tried the following code to retrieve the digits after "show" and it is able to produce a column where the show_id is in brackets (i.e., [51], [213]) and its type is pandas.core.series.Series.
Is there a more efficient way to get the show_id in integer form without the brackets? Appreciate any form of help, thank you
import urllib.parse as urlparse
df['protocol'],df['domain'],df['path'], df['query'], df['fragment'] = zip(*df['URL'].map(urlparse.urlsplit))
df['UID'] = df['path'].str.findall(r'(?<=show)[^,.\d\n]+?(\d+)')
Upvotes: 1
Views: 303
Reputation: 147286
You can use extract
to create a column by using a capture group to match the digits between forward slashes after show
:
df = pd.DataFrame({ 'sn' : [1, 2, 3, 4, 5],
'URL': ['https://tvseries.net/show/51/johnny155',
'https://tvseries.net/show/213/kimble2',
'https://tvseries.net/show/46/forceps',
'https://tvseries.net/show/90/tr9',
'https://tvseries.net/show/22/candlenut'
]})
df['show_id'] = df['URL'].str.extract('show/(\d+)/')
df
Output
sn URL show_id
0 1 https://tvseries.net/show/51/johnny155 51
1 2 https://tvseries.net/show/213/kimble2 213
2 3 https://tvseries.net/show/46/forceps 46
3 4 https://tvseries.net/show/90/tr9 90
4 5 https://tvseries.net/show/22/candlenut 22
Upvotes: 2