Reputation: 86
I have a string with several URLs inside it. I have managed to use regex to extract the first URL, but I really need them all. My script so far below:
data = ['https://images-na.ssl-images-amazon.com/images/I/41M9WbK3MDL._SX342_.jpg":[355,342],"https://images-na.ssl-images-amazon.com/images/I/41M9WbK3MDL._SX425_.jpg":[441,425],"https://images-na.ssl-images-amazon.com/images/I/41M9WbK3MDL.jpg":[500,482],"https://images-na.ssl-images-amazon.com/images/I/41M9WbK3MDL._SX466_.jpg":[483,466],"https://images-na.ssl-images-amazon.com/images/I/41M9WbK3MDL._SX385_.jpg":[399,385]}']
url = data[data.find("https://"):]
url[:url.find('"')]
Sorry - above script didn't use regex, but was another way I tried to do this. My regex script below which pretty much does the same thing. I don't really mind what we use, just want to try get all the URLs, since both my scripts only extract the first URL.
url=re.search('(https)://.*?\.(jpg)', data)
if url:
print(url.group(0))
I am scraping amazon products - this is the context. I've also updated the string to one of the actual examples.. Thanks everyone for the comments/help
Upvotes: 0
Views: 60
Reputation: 296
Maybe this way:
URL_list = [i for i in data.split('"') if 'http' in i]
It doesn't use regex, but in this code I don't see a need for regex.
Upvotes: 1
Reputation: 3372
Your new example string (from data[0]
) is missing an opening curly brace and a double quote but after adding that, you can read it as JSON using the standard library. You might have simply copy/pasted it incorrectly.
In[2]: data = ['https://images-na.ssl-images-amazon.com/images/I/41M9WbK3MDL._SX342_.jpg":[355,342],"https://images-na.ssl-images-amazon.com/images/I/41M9WbK3MDL._SX425_.jpg":[441,425],"https://images-na.ssl-images-amazon.com/images/I/41M9WbK3MDL.jpg":[500,482],"https://images-na.ssl-images-amazon.com/images/I/41M9WbK3MDL._SX466_.jpg":[483,466],"https://images-na.ssl-images-amazon.com/images/I/41M9WbK3MDL._SX385_.jpg":[399,385]}']
In[3]: import json
In[4]: d = json.loads('{"%s' % data[0])
In[5]: d
Out[5]:
{'https://images-na.ssl-images-amazon.com/images/I/41M9WbK3MDL._SX342_.jpg': [355,
342],
'https://images-na.ssl-images-amazon.com/images/I/41M9WbK3MDL._SX425_.jpg': [441,
425],
'https://images-na.ssl-images-amazon.com/images/I/41M9WbK3MDL.jpg': [500,
482],
'https://images-na.ssl-images-amazon.com/images/I/41M9WbK3MDL._SX466_.jpg': [483,
466],
'https://images-na.ssl-images-amazon.com/images/I/41M9WbK3MDL._SX385_.jpg': [399,
385]}
In[6]: list(d.keys())
Out[6]:
['https://images-na.ssl-images-amazon.com/images/I/41M9WbK3MDL._SX342_.jpg',
'https://images-na.ssl-images-amazon.com/images/I/41M9WbK3MDL._SX425_.jpg',
'https://images-na.ssl-images-amazon.com/images/I/41M9WbK3MDL.jpg',
'https://images-na.ssl-images-amazon.com/images/I/41M9WbK3MDL._SX466_.jpg',
'https://images-na.ssl-images-amazon.com/images/I/41M9WbK3MDL._SX385_.jpg']
Upvotes: 1