BeautifulSoup find_all returns duplicates

Question

I'm trying to get metadata about journal articles; specifically, which section of the journal each article falls under. I'm using find_all to first get all the tags with the article titles, and then using that to parse the tags with the article section and url info.

When I was testing my code, I had it print all the titles, urls, and article types to the terminal so I could check if my script was grabbing the right data. The correct info was printing (that is, all the unique titles and urls and their article types), so I figured that I was on the right track.

Problem is that when I actually run the code I've pasted below, the output has the correct number of rows for the number of articles in the issue, but each row is a duplicate of the metadata for the last article in that issue, instead of showing the unique data for each article. For instance, if one issue has 42 articles, instead of 42 rows in the output each representing a different article in that issue, I only get the data for the last article in that issue duplicated 42 times in the output.

What am I neglecting to include in my code that would ensure that the output actually has all the unique data for each article in these issues?

import requests
from bs4 import BeautifulSoup
import csv
import pandas as pd 
import re
from lxml.html import fromstring
import requests
from itertools import cycle
import traceback

def get_proxies():
    url = 'https://free-proxy-list.net/'
    response = requests.get(url)
    parser = fromstring(response.text)
    proxies = set()
    for i in parser.xpath('//tbody/tr')[:10]:
        if i.xpath('.//td[7][contains(text(),"yes")]'):
            proxy = ":".join([i.xpath('.//td[1]/text()')[0], i.xpath('.//td[2]/text()')[0]])
            proxies.add(proxy)
    return proxies


json_data =[]
base_url = 'https://ajph.aphapublications.org'

#Get Health Affairs 2018 issues

ajph2018 = ['https://ajph.aphapublications.org/toc/ajph/108/1',
            'https://ajph.aphapublications.org/toc/ajph/108/2',
            'https://ajph.aphapublications.org/toc/ajph/108/3',
            'https://ajph.aphapublications.org/toc/ajph/108/4',
            'https://ajph.aphapublications.org/toc/ajph/108/5',
            'https://ajph.aphapublications.org/toc/ajph/108/6',
            'https://ajph.aphapublications.org/toc/ajph/108/7',
            'https://ajph.aphapublications.org/toc/ajph/108/8',
            'https://ajph.aphapublications.org/toc/ajph/108/9',
            'https://ajph.aphapublications.org/toc/ajph/108/10',
            'https://ajph.aphapublications.org/toc/ajph/108/11',
            'https://ajph.aphapublications.org/toc/ajph/108/12',
            'https://ajph.aphapublications.org/toc/ajph/108/S1',
            'https://ajph.aphapublications.org/toc/ajph/108/S2',
            'https://ajph.aphapublications.org/toc/ajph/108/S3',
            'https://ajph.aphapublications.org/toc/ajph/108/S4',
            'https://ajph.aphapublications.org/toc/ajph/108/S5',
            'https://ajph.aphapublications.org/toc/ajph/108/S6',
            'https://ajph.aphapublications.org/toc/ajph/108/S7']

for a in ajph2018:
    issue=requests.get(a)
    soup1=BeautifulSoup(issue.text, 'lxml')

    
#Get articles data 
    ajph18_dict={"url":"NaN","articletype":"NaN", "title":"NaN"}
    all_titles = soup1.find_all("span", {"class":"hlFld-Title"})

    for each in all_titles: 
        title = each.text.strip()
        articletype=each.find_previous("h2", {"class":"tocHeading"}).text.strip()
        doi_tag = each.find_previous("a", {"class":"ref nowrap", "href": True})
        doi = doi_tag["href"]
        url = base_url + doi 
               
      
        if url is not None:
            ajph18_dict["url"]=url

        if title is not None:
            ajph18_dict["title"]=title

        if articletype is not None:
            ajph18_dict["articletype"]=articletype.text.strip()
    
        
        json_data.append(ajph18_dict)

df=pd.DataFrame(json_data)
df.to_csv("ajph_type.csv")

print("Saved")

Iain Shelvington · Accepted Answer

You're appending the same dictionary (ajph18_dict) every time in your for loop, any changes made to that dictionary will be reflected in every element of the list. The last loop overwrites all previous changes so you just get values from the last loop

You need to put the line ajph18_dict={"url":"NaN","articletype":"NaN", "title":"NaN"} first thing in the for loop so that a new object is created every loop

For example:

d = {}
l = []
for i in range(3):
    d['foo'] = i
    l.append(d)  # This just appends a reference to the same object every time

l is now a list with 3 elements that are all references to the same dictionary d. d now looks like this {'foo': 2} and l now looks like this [{'foo': 2}, {'foo': 2}, {'foo': 2}]

l = []
for i in range(3):
    d = {}  # "d" is a new object every loop
    d['foo'] = i
    l.append(d)  # every element in "l" is a different object

[{'foo': 0}, {'foo': 1}, {'foo': 2}]

BeautifulSoup find_all returns duplicates

Answers (1)

Related Questions