Reputation: 11
I am using beautifulsoup
to scrape the data. There are multiple urls and I have to save the data I scrape from these urls in the same CSV file. When I try to scrape from separate files and save to the same CSV file, the data in the last url I scraped in the CSV file is there. Below is the piece of code that I scraped the data from.
images = []
pages = np.arange(1, 2, 1)
for page in pages:
url = "https://www.bkmkitap.com/sanat"
results = requests.get(url, headers=headers)
soup = BeautifulSoup(results.content, "html.parser")
book_div = soup.find_all("div", class_="col col-12 drop-down hover lightBg")
sleep(randint(2, 10))
for bookSection in book_div:
img_url = bookSection.find("img", class_="lazy stImage").get('data-src')
images.append(img_url)
books = pd.DataFrame(
{
"Image": images,
} )
books.to_csv("bkm_art.csv", index=False, header=True,encoding = 'utf-8-sig')
Upvotes: 0
Views: 960
Reputation: 25073
Main issue in your example is that you do not call the second page, so you wont get these results - Iterate all of them and then create your CSV.
Second one, as you want to append data to an existing file, is figured out by @M B
Note: Try to avoid selecting your elements by classes, cause they are more dynamic then id
or HTML structure
import requests, random
from bs4 import BeautifulSoup
data = []
for page in range(1, 3, 1):
url = f"https://www.bkmkitap.com/sanat?pg={page}"
results = requests.get(url, headers=headers)
soup = BeautifulSoup(results.content, "html.parser")
for bookSection in soup.select('[id*="product-detail"]'):
data.append({
'image':bookSection.find("img", class_="lazy stImage").get('data-src')
})
books = pd.DataFrame(data)
books.to_csv("bkm_art.csv", index=False, header=True,encoding = 'utf-8-sig')
image
0 https://cdn.bkmkitap.com/sanat-dunyamiz-190-ey...
1 https://cdn.bkmkitap.com/sanat-dunyamiz-189-te...
2 https://cdn.bkmkitap.com/tiyatro-gazetesi-sayi...
3 https://cdn.bkmkitap.com/mavi-gok-kultur-sanat...
4 https://cdn.bkmkitap.com/sanat-dunyamiz-iki-ay...
... ...
112 https://cdn.bkmkitap.com/hayal-perdesi-iki-ayl...
113 https://cdn.bkmkitap.com/cins-aylik-kultur-der...
114 https://cdn.bkmkitap.com/masa-dergisi-sayi-48-...
115 https://cdn.bkmkitap.com/istanbul-sanat-dergis...
116 https://cdn.bkmkitap.com/masa-dergisi-sayi-49-...
117 rows × 1 columns
Upvotes: 0
Reputation: 166
import numpy as np
import pandas as pd
pages = np.arange(1, 2, 1)
for page in pages:
print(page)
try it , you will find you just get 1
may be you can use
pages = range(1, 2, 1)
Upvotes: 0
Reputation: 3430
Your question isn't very clear. When you run this, I assume a csv gets created with all the image urls, and you want to rerun this same script and have other image URL's get appended to the same csv? If that is the case, then you only need to change the to_csv
function call to:
books.to_csv("bkm_art.csv", mode='a', index=False, header=False ,encoding = 'utf-8-sig')
Adding mode='a'
starts appending to the file instead of overwriting it (doc).
Upvotes: 1