Sam Russo
Sam Russo

Reputation: 31

Beautiful Soup parsing inline <div> and <p> into dictionary

I am working on a messing portion of HTML for location store data and having a hard time parsing it cleanly. I have read a couple of other posts in here but haven't gotten anything to work successfully.

Below is a portion of the HTML, from the txt files:

"
                    ^ class=""location"">
                        <h2>
                            <a href=""/Locations/AL/5-Points-In-Line"">5 Points In-Line</a>
                        </h2>

                        <p>
                            2000 Highland Ave S
                            <br/>
                            Birmingham, AL 35205
                            <br/>
                            (205) 930-8000                        
                        </p>
                    </div>
                    ^ class=""location"">
                        <h2>

                            <a href=""/Locations/AL/Airport-Blvd-AL"">Airport Blvd (AL)</a>
                        </h2>

                        <p>
                            4707 Airport Blvd
                            <br/>Mobile, AL 36608
                                <br/>
(251) 461-9933                        </p>
                    </div>
                    ^ class=""location"">
                        <h2>

                            <a href=""/Locations/AL/Alabama-Power"">Alabama Power</a>
                        </h2>

                        <p>
                            600 18th St N
                            <br/>Birmingham, AL 35203
                                <br/>
(205) 257-1688                        </p>
                    </div>

What I need is the 'a' (location) as the key and the address information in 'p' as the value in a dictionary. The problem also is all the br/ in the address portion.

Ideally I would want:

{'5-Points-In-Line':['2000 Highland Ave S','Birmingham AL 35205','(205)930-8000'],...]

Here's what I have so far, which isn't close to that: from bs4 import BeautifulSoup

#import Location Tables

with open('AL.txt','r') as f:
    contents = f.read()    
    soup = BeautifulSoup(contents, 'html.parser')

    result = {}

for div in soup.find_all('div'):
    
    for h in soup.find_all('h2'):
        location = h.find('a').text
        
        for p in soup.find_all('p'): 
            p = p.text.replace('\n','|').replace('\t','').strip()
            clean = ' '.join(p.split()).replace('| ','|').replace(' |','|').replace('||','|')
            address_clean = clean.replace('| ','|').replace(' |','|').replace('||','|')
            
            result[location].append[address_clean]
            
result

Getting KeyError: '5 Points In-Line'

I was referencing the below similar post around this but I couldn't get the outcome to work and I'm thinking its the files I'm having to parse.

Beautiful Soup parsing inline <div> and <p> into dictionary

Upvotes: 1

Views: 215

Answers (2)

HedgeHog
HedgeHog

Reputation: 25073

Just to answer your question concerning seperation and be slightly closer to your expected output, you can do the following.

While getting the text from the <p> use an alternativ seperator instead the whitespace and split() the string to generate a list with separated info.

Example (dictionary comprehension)

{tag.get_text():tag.find_next('p').get_text(strip=True, separator='|').split('|') for tag in soup.select('h2  a')}

Output

{'5 Points In-Line': ['2000 Highland Ave S', 'Birmingham, AL 35205','(205) 930-8000'], 'Airport Blvd (AL)': ['4707 Airport Blvd', 'Mobile, AL 36608', '(251) 461-9933'], 'Alabama Power': ['600 18th St N', 'Birmingham, AL 35203', '(205) 257-1688']}

Upvotes: 0

MendelG
MendelG

Reputation: 20038

You can use thefind_next() and add the values to a dict:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')


output = {}
for tag in soup.select('h2 a'):
    output.setdefault(tag.get_text(), []).append(tag.find_next('p').get_text(strip=True, separator=' '))
    
print(output)

Output:

{'5 Points In-Line': ['2000 Highland Ave S Birmingham, AL 35205 (205) 930-8000'], 'Airport Blvd (AL)': ['4707 Airport Blvd Mobile, AL 36608 (251) 461-9933'], 'Alabama Power': ['600 18th St N Birmingham, AL 35203 (205) 257-1688']}

Upvotes: 1

Related Questions