Reputation: 99
I am a self-learner in Python and i'm right now exploring web-scraping and things like that. I have been working with Tumblr pictures, they work oddly, since they have several links in the same sentence, but I have been able to get one link per line, but I just want to get one link.
I guessed this would returned me something like:
source (blog_name)
link (the link which ends with 400w)
But no, I won't get any results from this.
If I take off the if
statement, I would get something like this:
source (blog_name)
link (75w)
link (100w)
link (250w)
link (400w)
link (400w)
Here is the code:
import requests
from bs4 import BeautifulSoup
posts_scrape = requests.get('tumblr.com/search/thingtosearch')
soup = BeautifulSoup(posts_scrape.text, 'html.parser')
articles = soup.find_all('article', class_='_2DpMA')
def getdata(url):
r = requests.get(url)
return r.text
for article in articles:
try:
post_notes = article.find('span', class_='_22VV4').text
if 'notes' in post_notes:
source = article.find('div', class_='_3QBiZ').text
for imgvar in article.find_all('img', alt='Image'):
url_results = imgvar['srcset']
r_urls = url_results.replace(',','\n')
for line in r_urls:
if line.find("400w"):
print(source)
print(r_urls)
break
except AttributeError:
continue
Upvotes: 0
Views: 84
Reputation: 573
So, You have kind of messed up your code and it uses a lot many for loops
than required.
The reason your code doesn't work as expected is cause you are using the if
conditional to check line.find('400w')
which will either return an index or -1
no matter which index it returns other than index 0
the if
conditional is always going to evalueate to True
.
And secondly your line
and r_urls
variables are strings which contain all urls and one of those urls is always going to end in 400w
so it will always evaluate to True
as it can never be at the index 0
and as it is a string when you output print(r_urls)
it will print urls altogether as these were in the same string.
The following is somewhat cleaner way to do what you want:
import requests
from bs4 import BeautifulSoup
search_term = 'dog'
posts_scrape = requests.get('https://www.tumblr.com/search/search_term')
soup = BeautifulSoup(posts_scrape.text, 'html.parser')
articles = soup.find_all('article', class_='_2DpMA')
for article in articles:
try:
source = article.find('div', class_='_3QBiZ').text
urls = []
for imgvar_avatar in article.find_all('img', alt='Image'):
url_list = [i for i in imgvar_avatar['srcset'].split(',') if (i.find('400w') != -1)]
urls.extend(url_list)
print(f'{source} : {urls}')
except AttributeError:
continue
This outputs in the following format:
blog name: [list of links for all the 'Images' in it with width 400w]
Sample output for search_term = 'dog'
:
everythingfox : []
liriusworldfaws : []
cuteness--overload : [' https://64.media.tumblr.com/bdbfcdf3fc0462eb3be656f0c8085792/e47c10ace1710c69-dc/s400x600/41420a481f8150b866eab574e56cc43e6d8181ef.jpg 400w']
everythingfox : []
delta-breezes : [' https://64.media.tumblr.com/6ed0b95f72eb90a88dd15cb546d913c8/1a24f512409f7700-c6/s400x600/456a6e3ac2f073fba90dbe885c5868063f3a1f39.jpg 400w']
fluffygif : []
scampthecorgi : [' https://64.media.tumblr.com/05c8f0b3345906fc7e6c04282cee9382/4b68a8516b31bce5-6c/s400x600/bd689d26dcb0d39e1c9c2aec494247da425f5f25.jpg 400w']
sirartwork : [' https://64.media.tumblr.com/7329e508e44714f33f90bd69a26fb08e/d998f5e61b3dfe95-b5/s400x600/1c40fbb26161cfac31e9fe72c89bc4305dc9820e.jpg 400w']
k-ayo : [' https://64.media.tumblr.com/36580aa2e20ce45761d4d76f0c9a502d/044c64a380aaef8e-74/s400x600/bdaa0c14a96c8d4bf5cd61620e0d7384a1c13b05.jpg 400w', ' https://64.media.tumblr.com/a42440c97a294c4f53b8a0747d5a009e/044c64a380aaef8e-b3/s400x600/d49ff2b41c44846e002ed96d45e1adfcd59b2753.jpg 400w', ' https://64.media.tumblr.com/44f559aaf1babb250e650cfc3fa94070/044c64a380aaef8e-95/s400x600/abf3c7e3e6ba0de41258a1e7424d961dcb19616b.jpg 400w', ' https://64.media.tumblr.com/44a2c5351564ab917e114672daa737ab/044c64a380aaef8e-8e/s400x600/181e0316dce57706532cbf1becc3509725fc7683.jpg 400w']
pugsandfrenchbulldogs : [' https://64.media.tumblr.com/c09c92db68fd0beee2ede0cffff896c2/658837adc5e2db43-7f/s400x600/ffe111d3133966be7c05507a6f75cf08d10afc59.jpg 400w']
cuteness--overload : []
everythingfox : []
fascinic : [' https://64.media.tumblr.com/55cebf41bb3c979fdc68d4c09e32d96c/9a6d6375418d7a14-0b/s400x600/5df22c0d01f39932b883585fcafc80944dd489c5.jpg 400w']
cuteness--overload : [' https://64.media.tumblr.com/dc6b9d9955244eefc8e6de0690f970d3/e6517da006b766fc-6a/s400x600/9a0a24db65fb8265e2b42d049f58ca376997743d.jpg 400w']
cuteness--overload : [' https://64.media.tumblr.com/2fcf4bfdf94bc7725fc1064cc5fb37bb/4a5db62ba4bbd86e-12/s400x600/61a9826f9eff0c2de1984c7f5fe0ef535560eece.jpg 400w']
catasters : []
male--wife : [' https://64.media.tumblr.com/b849ab48e135972b5a0566491bdcea93/1f844f181a482794-a1/s400x600/cc6605131bd901a0246e61448f7a9caca8112be9.png 400w', ' https://64.media.tumblr.com/97ec5949947d8aed0cec995c0a74e3c3/1f844f181a482794-65/s400x600/cc58a0efcbc621f7033e71166d310779faa2b400.png 400w', ' https://64.media.tumblr.com/099b77b5f560330b48757febabdfb314/1f844f181a482794-45/s400x600/1370ae9dc74d2f6e30a376703a8a0d09277cc5a8.png 400w', ' https://64.media.tumblr.com/322b69a71f34171cc05906b446b2615a/1f844f181a482794-49/s400x600/549108985e37b0c823be0b7438e1c7de834a9b2e.png 400w', ' https://64.media.tumblr.com/bfb082cdd44c2545e58424ec704cdb2c/1f844f181a482794-1b/s400x600/d5f1ba9af3aa3e930e329ac230980b73f6882faa.png 400w', ' https://64.media.tumblr.com/a210dbea09d5bfbb60c0d01b3d07825d/1f844f181a482794-c1/s400x600/b9d380228956b7493415b08fac6f2f7e22a1e484.png 400w', ' https://64.media.tumblr.com/2d904ff086e5fee6282aaa02d99f7045/1f844f181a482794-78/s400x600/82d26a959fa1f79072673c5a3a7ff52df93d574d.png 400w', ' https://64.media.tumblr.com/c478911a2ea835549fc3c94018086233/1f844f181a482794-46/s400x600/f307f706a186febd4c02723de297dbcd3524f1a4.png 400w', ' https://64.media.tumblr.com/2e091b16060114ab6a501556f1992be2/1f844f181a482794-71/s400x600/45ace560d7ef390d041fa4883c64a656280ebd66.png 400w', ' https://64.media.tumblr.com/6bf67260c9286b800af99c7d15ccfa42/1f844f181a482794-0d/s400x600/b5496b038d4c27824df53df6fa00e183d6fa9bd3.png 400w']
liriusworldfaws : []
puppy-esso : [' https://64.media.tumblr.com/825ca92b45b1810f9182ca9631bf1560/7ebd600d44780133-1f/s400x600/14ea5af4c9e149cb88863dfc0117f10a62c82e99.jpg 400w']
hitmewithcute : [' https://64.media.tumblr.com/db5993448e5d8bb8228e0f1b81142e7a/dcefa14e006d548b-77/s400x600/f4d1dadd5b6a49417c8d15573972775e9e707a53.jpg 400w']
Upvotes: 2