Reputation: 189
I am appending a .csv file with python. The data is scraped from the web. I am through with almost everything related to scraping.
The problem is coming when I am trying to append the file. It enters multiple >100s of entries of same data. So I am sure there is a problem with the loop/ for or if statements that i am not able to identify and solve.
The condition checks for similarity in data scraped from web and already existing data in file. If data doesn't match then program writes a new row, else it breaks or continues.
Note: csvFileArray is an array which checks data from existing file.txt. for example print(csvFileArray[0])
gives:
{'Date': '19/05/21', 'Time': '14:51:00', 'Status': 'Waitlisted', 'School': 'MIT Sloan', 'Details': 'GPA: 3.4 Round: Round 2 | Texas'}
Below is the code that has a problem.
file = open('file.csv', 'a')
writer = csv.writer(file)
#loop for page numbers
for page in range(15, 17):
print("Getting page {}..".format(page))
params["paged"] = page
data = requests.post(url, data=params).json()
soup = BeautifulSoup(data["markup"], "html.parser")
for entry in soup.select(".livewire-entry"):
datime = entry.select_one(".adate")
status = entry.select_one(".status")
name = status.find_next("strong")
details = entry.select_one(".lw-details")
datime = datime.get_text(strip=True)
datime = datetime.datetime.strptime(datime, '%B %d, %Y %I:%M%p')
time = datime.time() #returns time
date = datime.date() #returns date
for firstentry in csvFileArray:
condition = (((firstentry['Date']) == date) and ((firstentry['Time']) == time)
and ((firstentry['Status']) == (status.get_text(strip=True))) and ((firstentry['School']) == (name.get_text(strip=True)))
and ((firstentry['Details']) == details.get_text(strip=True)))
if condition:
continue
else:
writer.writerow([date, time, status.get_text(strip=True), name.get_text(strip=True),details.get_text(strip=True)])
#print('ok')
print("-" * 80)
file.close()
Upvotes: 0
Views: 61
Reputation: 54698
I'm guessing you want to write the line only if the condition is true for ALL of the csvFileArray
entries. Right now, you're writing it for EVERY csvFileArray
that doesn't match.
for entry in soup.select(".livewire-entry"):
datime = entry.select_one(".adate")
status = entry.select_one(".status")
name = status.find_next("strong")
details = entry.select_one(".lw-details")
datime = datime.get_text(strip=True)
datime = datetime.datetime.strptime(datime, '%B %d, %Y %I:%M%p')
time = datime.time() #returns time
date = datime.date() #returns date
should_write = True
for firstentry in csvFileArray:
if (((firstentry['Date']) == date) and ((firstentry['Time']) == time)
and ((firstentry['Status']) == (status.get_text(strip=True))) and ((firstentry['School']) == (name.get_text(strip=True)))
and ((firstentry['Details']) == details.get_text(strip=True))):
should_write = False
break
if should_write:
writer.writerow([date, time, status.get_text(strip=True), name.get_text(strip=True),details.get_text(strip=True)])
#print('ok')
You could also use a list comprehension for this, but because your condition is large, that gets hard to read:
if not any(
(((firstentry['Date']) == date) and ((firstentry['Time']) == time)
and ((firstentry['Status']) == (status.get_text(strip=True))) and ((firstentry['School']) == (name.get_text(strip=True)))
and ((firstentry['Details']) == details.get_text(strip=True)))
for firstentry in csvFileArray):
writer.writerow([date, time, status.get_text(strip=True), name.get_text(strip=True),details.get_text(strip=True)])
#print('ok')
Upvotes: 1