Problem with loop in python - Multiple entries of same data

Question

I am appending a .csv file with python. The data is scraped from the web. I am through with almost everything related to scraping.

The problem is coming when I am trying to append the file. It enters multiple >100s of entries of same data. So I am sure there is a problem with the loop/ for or if statements that i am not able to identify and solve.

The condition checks for similarity in data scraped from web and already existing data in file. If data doesn't match then program writes a new row, else it breaks or continues.

Note: csvFileArray is an array which checks data from existing file.txt. for example print(csvFileArray[0]) gives:

{'Date': '19/05/21', 'Time': '14:51:00', 'Status': 'Waitlisted', 'School': 'MIT Sloan', 'Details': 'GPA: 3.4 Round: Round 2 | Texas'}

Below is the code that has a problem.

file = open('file.csv', 'a')
writer = csv.writer(file) 

#loop for page numbers
for page in range(15, 17):
    
    print("Getting page {}..".format(page))
    
    params["paged"] = page
    data = requests.post(url, data=params).json()
    soup = BeautifulSoup(data["markup"], "html.parser")
    
    for entry in soup.select(".livewire-entry"):
        
        datime = entry.select_one(".adate")
        status = entry.select_one(".status")
        name = status.find_next("strong")
        details = entry.select_one(".lw-details")

        datime = datime.get_text(strip=True)
        datime = datetime.datetime.strptime(datime, '%B %d, %Y %I:%M%p')
        time = datime.time() #returns time
        date = datime.date() #returns date
        
        for firstentry in csvFileArray:
            condition = (((firstentry['Date']) == date) and ((firstentry['Time']) == time)
                        and ((firstentry['Status']) == (status.get_text(strip=True))) and ((firstentry['School']) == (name.get_text(strip=True)))
                        and ((firstentry['Details']) == details.get_text(strip=True)))
    
            
            if condition:
                continue
                        
            else:
                writer.writerow([date, time, status.get_text(strip=True), name.get_text(strip=True),details.get_text(strip=True)])       
                #print('ok')
        
            
    print("-" * 80) 

file.close()

Tim Roberts · Accepted Answer

I'm guessing you want to write the line only if the condition is true for ALL of the csvFileArray entries. Right now, you're writing it for EVERY csvFileArray that doesn't match.

    for entry in soup.select(".livewire-entry"):
        
        datime = entry.select_one(".adate")
        status = entry.select_one(".status")
        name = status.find_next("strong")
        details = entry.select_one(".lw-details")

        datime = datime.get_text(strip=True)
        datime = datetime.datetime.strptime(datime, '%B %d, %Y %I:%M%p')
        time = datime.time() #returns time
        date = datime.date() #returns date

        should_write = True        
        for firstentry in csvFileArray:
            if (((firstentry['Date']) == date) and ((firstentry['Time']) == time)
                        and ((firstentry['Status']) == (status.get_text(strip=True))) and ((firstentry['School']) == (name.get_text(strip=True)))
                        and ((firstentry['Details']) == details.get_text(strip=True))):
                should_write = False
                break
                        
        if should_write:
            writer.writerow([date, time, status.get_text(strip=True), name.get_text(strip=True),details.get_text(strip=True)])       
            #print('ok')

You could also use a list comprehension for this, but because your condition is large, that gets hard to read:

        if not any(
            (((firstentry['Date']) == date) and ((firstentry['Time']) == time)
                    and ((firstentry['Status']) == (status.get_text(strip=True))) and ((firstentry['School']) == (name.get_text(strip=True)))
                    and ((firstentry['Details']) == details.get_text(strip=True)))
            for firstentry in csvFileArray):
            writer.writerow([date, time, status.get_text(strip=True), name.get_text(strip=True),details.get_text(strip=True)])       
            #print('ok')

Problem with loop in python - Multiple entries of same data

Answers (1)

Related Questions