Reputation: 47
I have a script that is used to scrape data from a website and stores it into a spreadsheet
with open("c:\source\list.csv") as f:
for row in csv.reader(f):
for url in row:
r = requests.get(url)
soup = BeautifulSoup(r.content, 'lxml')
tables = soup.find('table', attrs={"class": "hpui-standardHrGrid-table"})
for rows in tables.find_all('tr', {'releasetype': 'Current_Releases'})[0::1]:
item = []
for val in rows.find_all('td'):
item.append(val.text.strip())
with open('c:\output_file.csv', 'a', newline='') as f:
writer = csv.writer(f)
writer.writerow({url})
writer.writerows(item)
As of right now, when this script runs, about 50 new lines are added to the bottom of the CSV file (Totally expected with the append function) but what I would like it to do is to determine if there are duplicate entries in the CSV file and skip them, and then change the mismatches.
I feel like this should be possible but I can't seem to think of a way
Any thoughts?
Upvotes: 0
Views: 1206
Reputation: 2240
You cannot do that without reading the data from the CSV file. Also to "change the mismatches", you will just have to over write them.
f = open('c:\output_file.csv', 'w', newline='')
writer = csv.writer(f)
for item in list_to_write_from:
writer.writerow(item)
Here, you are assuming that list_to_write_from
will contain the most current form of the data you need.
Upvotes: 1
Reputation: 47
I found a workaround to this problem as the answer provided did not work for me
I added:
if os.path.isfile("c:\source\output_file.csv"):
os.remove("c:\source\output_file.csv")
To the top of my code, as this will check to see if that file exists, and deletes it, only to recreate it with the most up to date information later. This is a duct tape way of doing things, but it works.
Upvotes: 0