Reputation: 85
I'm trying to get metadata about journal articles; specifically, which section of the journal each article falls under. I'm using find_all to first get all the tags with the article titles, and then using that to parse the tags with the article section and url info.
When I was testing my code, I had it print all the titles, urls, and article types to the terminal so I could check if my script was grabbing the right data. The correct info was printing (that is, all the unique titles and urls and their article types), so I figured that I was on the right track.
Problem is that when I actually run the code I've pasted below, the output has the correct number of rows for the number of articles in the issue, but each row is a duplicate of the metadata for the last article in that issue, instead of showing the unique data for each article. For instance, if one issue has 42 articles, instead of 42 rows in the output each representing a different article in that issue, I only get the data for the last article in that issue duplicated 42 times in the output.
What am I neglecting to include in my code that would ensure that the output actually has all the unique data for each article in these issues?
import requests
from bs4 import BeautifulSoup
import csv
import pandas as pd
import re
from lxml.html import fromstring
import requests
from itertools import cycle
import traceback
def get_proxies():
url = 'https://free-proxy-list.net/'
response = requests.get(url)
parser = fromstring(response.text)
proxies = set()
for i in parser.xpath('//tbody/tr')[:10]:
if i.xpath('.//td[7][contains(text(),"yes")]'):
proxy = ":".join([i.xpath('.//td[1]/text()')[0], i.xpath('.//td[2]/text()')[0]])
proxies.add(proxy)
return proxies
json_data =[]
base_url = 'https://ajph.aphapublications.org'
#Get Health Affairs 2018 issues
ajph2018 = ['https://ajph.aphapublications.org/toc/ajph/108/1',
'https://ajph.aphapublications.org/toc/ajph/108/2',
'https://ajph.aphapublications.org/toc/ajph/108/3',
'https://ajph.aphapublications.org/toc/ajph/108/4',
'https://ajph.aphapublications.org/toc/ajph/108/5',
'https://ajph.aphapublications.org/toc/ajph/108/6',
'https://ajph.aphapublications.org/toc/ajph/108/7',
'https://ajph.aphapublications.org/toc/ajph/108/8',
'https://ajph.aphapublications.org/toc/ajph/108/9',
'https://ajph.aphapublications.org/toc/ajph/108/10',
'https://ajph.aphapublications.org/toc/ajph/108/11',
'https://ajph.aphapublications.org/toc/ajph/108/12',
'https://ajph.aphapublications.org/toc/ajph/108/S1',
'https://ajph.aphapublications.org/toc/ajph/108/S2',
'https://ajph.aphapublications.org/toc/ajph/108/S3',
'https://ajph.aphapublications.org/toc/ajph/108/S4',
'https://ajph.aphapublications.org/toc/ajph/108/S5',
'https://ajph.aphapublications.org/toc/ajph/108/S6',
'https://ajph.aphapublications.org/toc/ajph/108/S7']
for a in ajph2018:
issue=requests.get(a)
soup1=BeautifulSoup(issue.text, 'lxml')
#Get articles data
ajph18_dict={"url":"NaN","articletype":"NaN", "title":"NaN"}
all_titles = soup1.find_all("span", {"class":"hlFld-Title"})
for each in all_titles:
title = each.text.strip()
articletype=each.find_previous("h2", {"class":"tocHeading"}).text.strip()
doi_tag = each.find_previous("a", {"class":"ref nowrap", "href": True})
doi = doi_tag["href"]
url = base_url + doi
if url is not None:
ajph18_dict["url"]=url
if title is not None:
ajph18_dict["title"]=title
if articletype is not None:
ajph18_dict["articletype"]=articletype.text.strip()
json_data.append(ajph18_dict)
df=pd.DataFrame(json_data)
df.to_csv("ajph_type.csv")
print("Saved")
Upvotes: 0
Views: 710
Reputation: 32244
You're appending the same dictionary (ajph18_dict
) every time in your for loop, any changes made to that dictionary will be reflected in every element of the list. The last loop overwrites all previous changes so you just get values from the last loop
You need to put the line ajph18_dict={"url":"NaN","articletype":"NaN", "title":"NaN"}
first thing in the for loop so that a new object is created every loop
For example:
d = {}
l = []
for i in range(3):
d['foo'] = i
l.append(d) # This just appends a reference to the same object every time
l
is now a list with 3 elements that are all references to the same dictionary d
. d
now looks like this {'foo': 2}
and l
now looks like this [{'foo': 2}, {'foo': 2}, {'foo': 2}]
l = []
for i in range(3):
d = {} # "d" is a new object every loop
d['foo'] = i
l.append(d) # every element in "l" is a different object
[{'foo': 0}, {'foo': 1}, {'foo': 2}]
Upvotes: 1