Reputation: 25

Delete all characters that come after a given string

how exactly can I delete characters after .jpg? is there a way to differentiate between the extension I take with python and what follows? for example I have a link like that

https://s13emagst.akamaized.net/products/29146/29145166/images/res_cd1fa80f252e88faa70ffd465c516741.jpg10DCC3DD9E74DC1D10104F623D7E9BDC

How can I delete everything after .jpg? I tried replacing but it didn't work another way? Use a forum to count strings or something like ? I tried to get jpg files with this

for link in links:
            res = requests.get(link).text
            soup = BeautifulSoup(res, 'html.parser')
            img_links = []
            for img in soup.select('a.thumbnail img[src]'):
                print(img["src"])
                with open('links'+'.csv', 'a', encoding = 'utf-8', newline='') as csv_file:
                    file_is_empty = os.stat(self.filename+'.csv').st_size == 0
                    fieldname = ['links']
                    writer = csv.DictWriter(csv_file, fieldnames = fieldname)
                    if file_is_empty:
                        writer.writeheader()
                    writer.writerow({'links':img["src"]})

                img_links.append(img["src"])

Upvotes: 1

Answers (6)

EXODIA

Reputation: 958

You can make use of regular expression. You just want to ignore the characters after .jpg so you can some use of something like this:

import re
new_url=re.findall("(.*\.jpg).*",old_url)[0]

(.*\.jpg) is like a capturing group where you're matching any number of characters before .jpg. Since . has a special meaning you need to escape the . in jpg with a \. .* is used to match any number of character but since this is not inside the capturing group () this will get matched but won't get extracted.

Upvotes: 1

Alain T.

Reputation: 42143

You could use a regular expression to replace everything after .jpg with an empty string:

import re

url  ='https://s13emagst.akamaized.net/products/29146/29145166/images/res_cd1fa80f252e88faa70ffd465c516741.jpg10DCC3DD9E74DC1D10104F623D7E9BDC'

name = re.sub(r'(?<=\.jpg).*',"",url)

print(name)
https://s13emagst.akamaized.net/products/29146/29145166/images/res_cd1fa80f252e88faa70ffd465c516741.jpg

Upvotes: 0

Florian Fasmeyer

Reputation: 879

See: Extracting extension from filename in Python

Instead of extracting the extension, we extract the filename and add the extension (if we know it's always .jpg, it's fine!)

import os
filename, file_extension = os.path.splitext('/path/to/somefile.jpg_corruptedpath')
result = filename + '.jpg'

Now, outside of the original question, I think there might be something wrong with how you got that piece of information int he first place. There must be a better way to extract that jpeg without messing around with the path. Sadly I can't help you with that since I a novice with BeautifulSoup.

Upvotes: 0

fynmnx

Reputation: 611

You could use split (assuming the string has 'jpg', otherwise the code below will just return the original url).

string = 'https://s13emagst.akamaized.net/products/29146/29145166/images/res_cd1fa80f252e88faa70ffd465c516741.jpg10DCC3DD9E74DC1D10104F623D7E9BDC'
jpg_removed = string.split('.jpg')[0]+'.jpg'

Example

string = 'www.google.com'
com_removed = string.split('.com')[0] 
# com_removed = 'www.google'

Upvotes: 3

Rima

Reputation: 1455

The find() method returns the lowest index of the substring if it is found in given string. If its is not found then it returns -1.

str ='https://s13emagst.akamaized.net/products/29146/29145166/images/res_cd1fa80f252e88faa70ffd465c516741.jpg10DCC3DD9E74DC1D10104F623D7E9BDC'
result = str.find('jpg')
print(result)
new_str = str[:result]

print(new_str+'jpg')

Upvotes: 0

Dwight Foster

Reputation: 352

You can use the .find function to find the characters .jpg then you can index the string to get everything but that. Ex:

string = https://s13emagst.akamaized.net/products/29146/29145166/images/res_cd1fa80f252e88faa70ffd465c516741.jpg10DCC3DD9E74DC1D10104F623D7E9BDC
index = string.find(".jpg")
new_string = string[:index+ 4]

You have to add four because that is the length of jpg so it does not delete that too.

Upvotes: 0

Delete all characters that come after a given string

Answers (6)

Related Questions