Reputation: 31
I am working on a messing portion of HTML for location store data and having a hard time parsing it cleanly. I have read a couple of other posts in here but haven't gotten anything to work successfully.
Below is a portion of the HTML, from the txt files:
"
^ class=""location"">
<h2>
<a href=""/Locations/AL/5-Points-In-Line"">5 Points In-Line</a>
</h2>
<p>
2000 Highland Ave S
<br/>
Birmingham, AL 35205
<br/>
(205) 930-8000
</p>
</div>
^ class=""location"">
<h2>
<a href=""/Locations/AL/Airport-Blvd-AL"">Airport Blvd (AL)</a>
</h2>
<p>
4707 Airport Blvd
<br/>Mobile, AL 36608
<br/>
(251) 461-9933 </p>
</div>
^ class=""location"">
<h2>
<a href=""/Locations/AL/Alabama-Power"">Alabama Power</a>
</h2>
<p>
600 18th St N
<br/>Birmingham, AL 35203
<br/>
(205) 257-1688 </p>
</div>
What I need is the 'a' (location) as the key and the address information in 'p' as the value in a dictionary. The problem also is all the br/ in the address portion.
Ideally I would want:
{'5-Points-In-Line':['2000 Highland Ave S','Birmingham AL 35205','(205)930-8000'],...]
Here's what I have so far, which isn't close to that: from bs4 import BeautifulSoup
#import Location Tables
with open('AL.txt','r') as f:
contents = f.read()
soup = BeautifulSoup(contents, 'html.parser')
result = {}
for div in soup.find_all('div'):
for h in soup.find_all('h2'):
location = h.find('a').text
for p in soup.find_all('p'):
p = p.text.replace('\n','|').replace('\t','').strip()
clean = ' '.join(p.split()).replace('| ','|').replace(' |','|').replace('||','|')
address_clean = clean.replace('| ','|').replace(' |','|').replace('||','|')
result[location].append[address_clean]
result
Getting KeyError: '5 Points In-Line'
I was referencing the below similar post around this but I couldn't get the outcome to work and I'm thinking its the files I'm having to parse.
Beautiful Soup parsing inline <div> and <p> into dictionary
Upvotes: 1
Views: 215
Reputation: 25073
Just to answer your question concerning seperation and be slightly closer to your expected output, you can do the following.
While getting the text from the <p>
use an alternativ seperator instead the whitespace and split()
the string
to generate a list
with separated info.
dictionary comprehension
){tag.get_text():tag.find_next('p').get_text(strip=True, separator='|').split('|') for tag in soup.select('h2 a')}
{'5 Points In-Line': ['2000 Highland Ave S', 'Birmingham, AL 35205','(205) 930-8000'], 'Airport Blvd (AL)': ['4707 Airport Blvd', 'Mobile, AL 36608', '(251) 461-9933'], 'Alabama Power': ['600 18th St N', 'Birmingham, AL 35203', '(205) 257-1688']}
Upvotes: 0
Reputation: 20038
You can use thefind_next()
and add the values to a dict
:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
output = {}
for tag in soup.select('h2 a'):
output.setdefault(tag.get_text(), []).append(tag.find_next('p').get_text(strip=True, separator=' '))
print(output)
Output:
{'5 Points In-Line': ['2000 Highland Ave S Birmingham, AL 35205 (205) 930-8000'], 'Airport Blvd (AL)': ['4707 Airport Blvd Mobile, AL 36608 (251) 461-9933'], 'Alabama Power': ['600 18th St N Birmingham, AL 35203 (205) 257-1688']}
Upvotes: 1