Reputation: 353
I apologize in advance for the long post, but I've made sure it's easy to follow and very clear.
My question is this:
How can I create a nested dictionary out of lists, with specified duplicate keys?
Here's an example of what I'd like to make, using data for a fictional news article:
{'http://www.SomeNewsWebsite.com/Article12345':
{'Title': 'Trump Does Another Ridiculous Thing',
'Source': 'Some News Website',
'Thumbnail': 'SomeNewsWebsite.com/image12345'}}
Reading a similar post, I have seen people do similar things but have struggled to port those ideas into my own work.
That's the end of my question. Below, I've posted my code and example lists generated by said code, which is what I'd be using to make this nested dictionary. It's also available on my Github.
So far, I can use the following code to fetch the data, cut out the important bits, and then make two lists-- one for URLs, one for Titles. Then it uses Zip to combine them into a tidy dictionary.
url = "http://www.reuters.com"
source = "Reuters"
thumbnail = "http://logok.org/wp-content/uploads/2014/04/Reuters-logo.png"
def soup():
""" Fetches HTML from site and turns it into a bs4 object. """
get_html = requests.get(url)
html = get_html.text
make_soup = BeautifulSoup(html, 'html.parser')
return make_soup
# Tell bs4 where to find the important information (headlines, URLs)
important_data = (soup().select(".story-content > .story-title > a"))
# Turn that important data into a string so it may be parsed using RegEx
stringed_data = ' || '.join(str(v) for v in important_data)
def get_headline():
""" Uses Regular Expressions to find headlines. Returns a list. """
headline = re.findall(r'(?<=">)(.*?)(?=</a>)', stringed_data)
return headline
def get_link():
""" Uses Regular Expressions to find links. Returns a list. """
link = re.findall(r'(?<=<a href=")(.*?)(?=")', stringed_data)
return link
def build_dict():
""" Combine everything into a tidy dictionary. """
full_urls = [i if i.startswith('http') else url + i for i in get_link()]
reuters_dictionary = dict(zip(get_headline(), full_urls))
return full_urls
get_link()
get_headline()
soup()
build_dict()
When run, this code will create 2 lists, then a dictionary. Example data is shown below:
List of titles:(29 items long)
['Trump strikes defiant tone ahead of debate', 'Matthew swamps North Carolina, still dangerous as it heads out to sea', "Tesla's Musk says will not have to raise funds in fourth-quarter", 'Suspect arrested in fatal shooting of two California police officers', 'Russia says U.S. actions threaten its national security', 'Western-backed coalition under pressure over Yemen raid', "Fed's Fischer says job gains solid, expects growth to pick up", "Thai king's condition unstable after hemodialysis treatment: palace", 'Pope names new group of cardinals, adding to potential successors', 'Palestinian kills two people in Jerusalem, then shot dead: police', "Commentary: House of Lies — the uncanny allure of 'Girl on the Train'", 'Earnings season begins as White House race heats up', 'Russia expects OPEC to ask non members to consider joining output curb', 'Banks ponder the meaning of life as Deutsche agonizes', 'IMF says still engaged with Greece, no decision yet on bailout role', 'Pound slump exacerbates Brexit impact for German exporters: DIHK', 'Iranian, Iraqi oil ministers will not attend Istanbul talks: sources', 'Ukraine military postpones withdrawal from town, cites rebel shelling', 'German police make new raid in hunt for refugee planning bomb attack', "South African President Zuma's rape accuser dies: family", 'Xi says China must speed up plans for domestic network technology', 'UberEats to expand to Berlin in 2017: Tagesspiegel', 'Beijing, Shanghai propose curbs on who can drive for ride-hailing services', 'Pressure on Trump likely to be intense at second debate with Clinton', "Sanders supporters seethe over Clinton's leaked remarks to Wall St.", 'Evangelical leaders stick with Trump, focus on defeating Clinton', 'Citi sells its Argentinian consumer business to Banco Santander', "Itaú to pay $220 million for Citigroup's Brazil assets", 'LafargeHolcim agrees sale of Chilean business Cemento Polpaico']
List of URLs: (29 items long)
['/article/us-usa-election-idUSKCN1290JZ', '/article/us-storm-matthew-idUSKCN129063', '/article/us-tesla-equity-solarcity-idUSKCN1290QW', '/article/us-california-police-shooting-idUSKCN1280YH', '/article/us-russia-usa-idUSKCN1290DP', '/article/us-yemen-security-coalition-pressure-idUSKCN1290JM', '/article/us-usa-fed-fischer-idUSKCN1290JB', '/article/us-thailand-king-idUSKCN1290R8', '/article/us-pope-cardinals-idUSKCN1290C9', '/article/us-israel-palestinians-violence-idUSKCN129070', '/article/us-society-entertainment-film-idUSKCN127229', '/article/us-usa-stocks-weekahead-idUSKCN1272HS', '/article/us-oil-opec-russia-idUSKCN1290KD', '/article/us-imf-g20-banks-idUSKCN1290DX', '/article/us-imf-g20-greece-idUSKCN1290R6', '/article/us-britain-eu-germany-idUSKCN1290TZ', '/article/us-oil-opec-istanbul-idUSKCN1290N2', '/article/us-ukraine-crisis-withdrawal-idUSKCN1290UL', '/article/us-germany-bomb-idUSKCN1290D2', '/article/us-safrica-zuma-idUSKCN1290SX', '/article/us-china-internet-security-idUSKCN1290LA', '/article/us-uber-germany-eats-idUSKCN1290OB', '/article/us-china-regulations-ride-hailing-idUSKCN1280EL', '/article/us-usa-election-debate-idUSKCN1290AS', '/article/us-usa-election-clinton-idUSKCN1280Z9', '/article/us-usa-election-trump-evangelicals-idUSKCN1280WE', '/article/us-citi-argentina-m-a-banco-santander-ri-idUSKCN1290SD', '/article/us-citibank-brasil-m-a-itau-unibco-hldg-idUSKCN1280HM', '/article/us-lafargeholcim-divestment-chile-idUSKCN1280BU']
Dictionary of titles and URLs: (29 items long)
{'Banks ponder the meaning of life as Deutsche agonizes': 'http://www.reuters.com/article/us-imf-g20-banks-idUSKCN1290DX', 'German police make new raid in hunt for refugee planning bomb attack': 'http://www.reuters.com/article/us-germany-bomb-idUSKCN1290D2', 'Suspect arrested in fatal shooting of two California police officers': 'http://www.reuters.com/article/us-california-police-shooting-idUSKCN1280YH', 'Evangelical leaders stick with Trump, focus on defeating Clinton': 'http://www.reuters.com/article/us-usa-election-trump-evangelicals-idUSKCN1280WE', 'Xi says China must speed up plans for domestic network technology': 'http://www.reuters.com/article/us-china-internet-security-idUSKCN1290LA', "Australia's Rinehart and China's Shanghai CRED agree on deal for Kidman cattle empire": 'http://www.reuters.com/article/us-australia-china-landsale-dakang-p-f-idUSKCN12908O', 'LafargeHolcim agrees sale of Chilean business Cemento Polpaico': 'http://www.reuters.com/article/us-lafargeholcim-divestment-chile-idUSKCN1280BU', 'Citi sells Argentinian consumer unit a day after Brazil sale': 'http://www.reuters.com/article/us-citi-argentina-m-a-banco-santander-ri-idUSKCN1290SD', 'Beijing, Shanghai propose curbs on who can drive for ride-hailing services': 'http://www.reuters.com/article/us-china-regulations-ride-hailing-idUSKCN1280EL', 'Pope names new group of cardinals, adding to potential successors': 'http://www.reuters.com/article/us-pope-cardinals-idUSKCN1290C9', "Commentary: House of Lies — the uncanny allure of 'Girl on the Train'": 'http://www.reuters.com/article/us-society-entertainment-film-idUSKCN127229', 'Iranian, Iraqi oil ministers will not attend Istanbul talks: sources': 'http://www.reuters.com/article/us-oil-opec-istanbul-idUSKCN1290N2', "South African President Zuma's rape accuser dies: family": 'http://www.reuters.com/article/us-safrica-zuma-idUSKCN1290SX', 'Palestinian kills two people in Jerusalem, then shot dead: police': 'http://www.reuters.com/article/us-israel-palestinians-violence-idUSKCN129070', 'Matthew swamps North Carolina, still dangerous as it heads out to sea': 'http://www.reuters.com/article/us-storm-matthew-idUSKCN129063', 'Western-backed coalition under pressure over Yemen raid': 'http://www.reuters.com/article/us-yemen-security-coalition-pressure-idUSKCN1290JM', 'Trump strikes defiant tone ahead of debate': 'http://www.reuters.com/article/us-usa-election-idUSKCN1290JZ', 'Russia says U.S. actions threaten its national security': 'http://www.reuters.com/article/us-russia-usa-idUSKCN1290DP', 'Pressure on Trump likely to be intense at second debate with Clinton': 'http://www.reuters.com/article/us-usa-election-debate-idUSKCN1290AS', "Sanders supporters seethe over Clinton's leaked remarks to Wall St.": 'http://www.reuters.com/article/us-usa-election-clinton-idUSKCN1280Z9', "Tesla's Musk says will not have to raise funds in fourth-quarter": 'http://www.reuters.com/article/us-tesla-equity-solarcity-idUSKCN1290QW', "Fed's Fischer says job gains solid, expects growth to pick up": 'http://www.reuters.com/article/us-usa-fed-fischer-idUSKCN1290JB', 'Ukraine military postpones withdrawal from town, cites rebel shelling': 'http://www.reuters.com/article/us-ukraine-crisis-withdrawal-idUSKCN1290UL', "Thai king's condition unstable after hemodialysis treatment: palace": 'http://www.reuters.com/article/us-thailand-king-idUSKCN1290R8', 'Earnings season begins as White House race heats up': 'http://www.reuters.com/article/us-usa-stocks-weekahead-idUSKCN1272HS', 'IMF says still engaged with Greece, no decision yet on bailout role': 'http://www.reuters.com/article/us-imf-g20-greece-idUSKCN1290R6', 'Pound slump exacerbates Brexit impact for German exporters: DIHK': 'http://www.reuters.com/article/us-britain-eu-germany-idUSKCN1290TZ', 'Russia expects OPEC to ask non members to consider joining output curb': 'http://www.reuters.com/article/us-oil-opec-russia-idUSKCN1290KD', 'UberEats to expand to Berlin in 2017: Tagesspiegel': 'http://www.reuters.com/article/us-uber-germany-eats-idUSKCN1290OB'}
For clarity, I'd like to use this data to create a dictionary for each pairing of title and URL, like the following:
{'http://www.reuters.com/article/us-imf-g20-banks-idUSKCN1290DX':
{'Title': 'Banks ponder the meaning of life as Deutsche agonizes',
'Source': 'Reuters',
'Thumbnail': 'http://logok.org/wp-content/uploads/2014/04/Reuters-logo.png'}}
Thanks a ton for taking the time to read, and thank you in advance for your help.
Upvotes: 0
Views: 997
Reputation: 1921
This should give you the result you want:
def build_dict():
""" Combine everything into a tidy dictionary. """
full_urls = [i if i.startswith('http') else url + i for i in get_link()]
reuters_dictionary = {}
for (headline, url) in zip(get_headline(), full_urls):
reuters_dictionary[url] = {
'Title': headline,
'Source': source,
'Thumbnail' : thumbnail
}
return full_urls # <- I think you want to do "return reuters_dictionary" here(?)
However, there is nothing about duplicate keys here. Why do you feel a need for duplicate keys?
Also you should probably refactor to remove those global variables.
At last if you are already using BeatifulSoup, why do you then fall back to regular expressions afterwards? I think using BeautifulSoup everywhere should be more robust.
Upvotes: 0
Reputation: 107687
Consider a dictionary comprehension:
newsdict = {v: {'Title': k,
'Source': 'Reuters',
'Thumbnail': 'http://logok.org/wp-content/uploads/2014/04/Reuters-logo.png'}
for k, v in reuters_dictionary.items()}
Upvotes: 1