matthew matthee
matthew matthee

Reputation: 86

Extract string with more than one URL

I have a string with several URLs inside it. I have managed to use regex to extract the first URL, but I really need them all. My script so far below:

data = ['https://images-na.ssl-images-amazon.com/images/I/41M9WbK3MDL._SX342_.jpg":[355,342],"https://images-na.ssl-images-amazon.com/images/I/41M9WbK3MDL._SX425_.jpg":[441,425],"https://images-na.ssl-images-amazon.com/images/I/41M9WbK3MDL.jpg":[500,482],"https://images-na.ssl-images-amazon.com/images/I/41M9WbK3MDL._SX466_.jpg":[483,466],"https://images-na.ssl-images-amazon.com/images/I/41M9WbK3MDL._SX385_.jpg":[399,385]}']
url = data[data.find("https://"):]
url[:url.find('"')]

Sorry - above script didn't use regex, but was another way I tried to do this. My regex script below which pretty much does the same thing. I don't really mind what we use, just want to try get all the URLs, since both my scripts only extract the first URL.

url=re.search('(https)://.*?\.(jpg)', data)
if url:
   print(url.group(0)) 

I am scraping amazon products - this is the context. I've also updated the string to one of the actual examples.. Thanks everyone for the comments/help

Upvotes: 0

Views: 60

Answers (2)

DecaK
DecaK

Reputation: 296

Maybe this way:

URL_list = [i for i in data.split('"') if 'http' in i]

It doesn't use regex, but in this code I don't see a need for regex.

Upvotes: 1

G_M
G_M

Reputation: 3372

Your new example string (from data[0]) is missing an opening curly brace and a double quote but after adding that, you can read it as JSON using the standard library. You might have simply copy/pasted it incorrectly.

In[2]: data = ['https://images-na.ssl-images-amazon.com/images/I/41M9WbK3MDL._SX342_.jpg":[355,342],"https://images-na.ssl-images-amazon.com/images/I/41M9WbK3MDL._SX425_.jpg":[441,425],"https://images-na.ssl-images-amazon.com/images/I/41M9WbK3MDL.jpg":[500,482],"https://images-na.ssl-images-amazon.com/images/I/41M9WbK3MDL._SX466_.jpg":[483,466],"https://images-na.ssl-images-amazon.com/images/I/41M9WbK3MDL._SX385_.jpg":[399,385]}']
In[3]: import json
In[4]: d = json.loads('{"%s' % data[0])
In[5]: d
Out[5]: 
{'https://images-na.ssl-images-amazon.com/images/I/41M9WbK3MDL._SX342_.jpg': [355,
  342],
 'https://images-na.ssl-images-amazon.com/images/I/41M9WbK3MDL._SX425_.jpg': [441,
  425],
 'https://images-na.ssl-images-amazon.com/images/I/41M9WbK3MDL.jpg': [500,
  482],
 'https://images-na.ssl-images-amazon.com/images/I/41M9WbK3MDL._SX466_.jpg': [483,
  466],
 'https://images-na.ssl-images-amazon.com/images/I/41M9WbK3MDL._SX385_.jpg': [399,
  385]}
In[6]: list(d.keys())
Out[6]: 
['https://images-na.ssl-images-amazon.com/images/I/41M9WbK3MDL._SX342_.jpg',
 'https://images-na.ssl-images-amazon.com/images/I/41M9WbK3MDL._SX425_.jpg',
 'https://images-na.ssl-images-amazon.com/images/I/41M9WbK3MDL.jpg',
 'https://images-na.ssl-images-amazon.com/images/I/41M9WbK3MDL._SX466_.jpg',
 'https://images-na.ssl-images-amazon.com/images/I/41M9WbK3MDL._SX385_.jpg']

Upvotes: 1

Related Questions