Reputation: 4038
I have a string:
test_string="lots of other html tags ,'https://news.sky.net/upload_files/image/2022/202209_166293.png',and still 'https://news.sky.net/upload_files/image/2022/202209_166293.jpg'"
How can I get the whole 2 urls in the string,by using python Regex ?
I tried:
pattern = 'https://news.sky.net/upload_files/image'
result = re.findall(pattern, test_string)
I can get a list:
['https://news.sky.net/upload_files/image','https://news.sky.net/upload_files/image']
but not the whole url ,so I tried:
pattern = 'https://news.sky.net/upload_files/image...$png'
result = re.findall(pattern, test_string)
But received an empty list.
Upvotes: 2
Views: 2047
Reputation: 302
You could match any URL inside the string you have by using the following regex '(https?://\S+)'
by applying this to your code it would be something like this:
import re
string = "Some string here'https://news.sky.net/upload_files/image/2022/202209_166293.png' And here as well 'https://news.sky.net/upload_files/image/2022/202209_166293.jpg' that's it tho"
res = re.findall(r"(http(s)?://\S+)", string)
print(res)
this will return a list of URLs got collected from the string:
[
'https://news.sky.net/upload_files/image/2022/202209_166293.png',
'https://news.sky.net/upload_files/image/2022/202209_166293.jpg'
]
'(https?://\S+)'
https?
- to check if the url is https
or http
\S+
- any non-whitespace character one or more timesSo this will get either https
or http
then after ://
characters it will take any non-whitespace character one or more times
Hope you find this helpful.
Upvotes: 2
Reputation: 147166
You could match a minimal number of characters after image
up to a .
and either png
or jpg
:
test_string = "lots of other html tags ,'https://news.sky.net/upload_files/image/2022/202209_166293.png',and still 'https://news.sky.net/upload_files/image/2022/202209_166293.jpg'"
pattern = r'https://news.sky.net/upload_files/image.*?\.(?:png|jpg)'
re.findall(pattern, test_string)
Output:
[
'https://news.sky.net/upload_files/image/2022/202209_166293.png',
'https://news.sky.net/upload_files/image/2022/202209_166293.jpg'
]
Upvotes: 2
Reputation: 521289
Assuming you would always expect the URLs to appear inside single quotes, we can use re.findall
as follows:
I have a string:
test_string = "lots of other html tags ,'https://news.sky.net/upload_files/image/2022/202209_166293.png',and still 'https://news.sky.net/upload_files/image/2022/202209_166293.jpg'"
urls = re.findall(r"'(https?:\S+?)'", test_string)
print(urls)
This prints:
['https://news.sky.net/upload_files/image/2022/202209_166293.png',
'https://news.sky.net/upload_files/image/2022/202209_166293.jpg']
Upvotes: 2