Reputation: 25
how exactly can I delete characters after .jpg? is there a way to differentiate between the extension I take with python and what follows? for example I have a link like that
https://s13emagst.akamaized.net/products/29146/29145166/images/res_cd1fa80f252e88faa70ffd465c516741.jpg10DCC3DD9E74DC1D10104F623D7E9BDC
How can I delete everything after .jpg? I tried replacing but it didn't work another way? Use a forum to count strings or something like ? I tried to get jpg files with this
for link in links:
res = requests.get(link).text
soup = BeautifulSoup(res, 'html.parser')
img_links = []
for img in soup.select('a.thumbnail img[src]'):
print(img["src"])
with open('links'+'.csv', 'a', encoding = 'utf-8', newline='') as csv_file:
file_is_empty = os.stat(self.filename+'.csv').st_size == 0
fieldname = ['links']
writer = csv.DictWriter(csv_file, fieldnames = fieldname)
if file_is_empty:
writer.writeheader()
writer.writerow({'links':img["src"]})
img_links.append(img["src"])
Upvotes: 1
Views: 1533
Reputation: 958
You can make use of regular expression. You just want to ignore the characters after .jpg so you can some use of something like this:
import re
new_url=re.findall("(.*\.jpg).*",old_url)[0]
(.*\.jpg)
is like a capturing group where you're matching any number of characters before .jpg. Since .
has a special meaning you need to escape the .
in jpg with a \
. .*
is used to match any number of character but since this is not inside the capturing group ()
this will get matched but won't get extracted.
Upvotes: 1
Reputation: 42143
You could use a regular expression to replace everything after .jpg with an empty string:
import re
url ='https://s13emagst.akamaized.net/products/29146/29145166/images/res_cd1fa80f252e88faa70ffd465c516741.jpg10DCC3DD9E74DC1D10104F623D7E9BDC'
name = re.sub(r'(?<=\.jpg).*',"",url)
print(name)
https://s13emagst.akamaized.net/products/29146/29145166/images/res_cd1fa80f252e88faa70ffd465c516741.jpg
Upvotes: 0
Reputation: 879
See: Extracting extension from filename in Python
Instead of extracting the extension, we extract the filename and add the extension (if we know it's always .jpg, it's fine!)
import os
filename, file_extension = os.path.splitext('/path/to/somefile.jpg_corruptedpath')
result = filename + '.jpg'
Now, outside of the original question, I think there might be something wrong with how you got that piece of information int he first place. There must be a better way to extract that jpeg without messing around with the path. Sadly I can't help you with that since I a novice with BeautifulSoup.
Upvotes: 0
Reputation: 611
You could use split (assuming the string has 'jpg', otherwise the code below will just return the original url).
string = 'https://s13emagst.akamaized.net/products/29146/29145166/images/res_cd1fa80f252e88faa70ffd465c516741.jpg10DCC3DD9E74DC1D10104F623D7E9BDC'
jpg_removed = string.split('.jpg')[0]+'.jpg'
Example
string = 'www.google.com'
com_removed = string.split('.com')[0]
# com_removed = 'www.google'
Upvotes: 3
Reputation: 1455
The find()
method returns the lowest index of the substring if it is found in given string. If its is not found then it returns -1.
str ='https://s13emagst.akamaized.net/products/29146/29145166/images/res_cd1fa80f252e88faa70ffd465c516741.jpg10DCC3DD9E74DC1D10104F623D7E9BDC'
result = str.find('jpg')
print(result)
new_str = str[:result]
print(new_str+'jpg')
Upvotes: 0
Reputation: 352
You can use the .find function to find the characters .jpg then you can index the string to get everything but that. Ex:
string = https://s13emagst.akamaized.net/products/29146/29145166/images/res_cd1fa80f252e88faa70ffd465c516741.jpg10DCC3DD9E74DC1D10104F623D7E9BDC
index = string.find(".jpg")
new_string = string[:index+ 4]
You have to add four because that is the length of jpg so it does not delete that too.
Upvotes: 0