BS100
BS100

Reputation: 873

How do I parse url after searching?

I'm trying to parse a specifc part of url after search using any language.(Ideally Javascript but open to Python)

How do I get a specific part of url and save/store?

For example, In songking.com, The way to get artist_id is checking a specific part of the url after searching artist name in the search bar of the website.

in the case below, the artist id is 301329.

https://www.songkick.com/artists/301329-rac

I strongly believe there is a way to parse this part using either python or js given that I have a csv file that has artist name in its column. Instead of searching all the artists one by one. I wonder about the algorithm that literate my csv column and search it and parse the url and save/store.

enter image description here

It would be very grateful even if I could only get a hint that I could start with.

Thank you so much always.

Upvotes: 0

Views: 127

Answers (2)

Perfect
Perfect

Reputation: 1636

First, you can use RegEx simply. In python

import re
url = 'https://www.songkick.com/artists/301329-rac'
pattern = '/artists/(\d+)-\w'
match = re.search(pattern, url)
if match:
    artist_id = match.group(1)

I hope this will help you.

Upvotes: 0

Oussama Romdhane
Oussama Romdhane

Reputation: 166

It can be done using regular expressions.

Here's an example of a JavaScript implementation

const url = "https://www.songkick.com/artists/301329-rac";

const regex = /https:\/\/www\.songkick\.com\/artists\/(\d+)-.+/;

const match = url.match(regex);

if (match) {
  console.log('Artist ID: ' + match[1]);
} else {
  console.log('No Artist ID found!');
}

This regular expression /https:\/\/www\.songkick\.com\/artists\/(\d+)-.+/ means that we're trying to match something that starts with https://www.songkick.com/artists/, preceded by a group of decimals a dash then a group of letters.

The match() method retrieves the result of matching a string against a regular expression.

Thus it will return the overall string in the first index, then the matched (\d+) group in the second index (match[1] in our case).

If you're not sure of the protocol (http vs https) you can add a ? in the regex right after https. That makes the s in https optional. So the regex would become /https?:\/\/www\.songkick\.com\/artists\/(\d+)-.+/.

Let me know if you need more explanation.

Upvotes: 1

Related Questions