Reputation: 73
I have an array created by using soup findAll, and its first element has the following information. In this list I only need address information, which is
"54000 NANCY 47 RUE SERGENT BLANDAN", how can I get this information?
{
"div": {
"@class": "result-left",
"h3": "Establishment(s)",
"div": [
{
"label": "Status:",
"#text": "Closed"
},
{
"p": {
"label": "Brand name:",
"#text": "LE ZODIAC"
}
},
{
"p": {
"label": "Usual name:"
}
},
{
"p": {
"label": "Address:",
"br": [
"",
"54000\r\n\t\t\t\t\t\t\t\t\t\t\tNANCY"
],
"#text": "47\r\n\t\t\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\t\tRUE\r\n\t\t\t\t\t\t\t\t\t\tSERGENT BLANDAN"
}
},
{
"p": {
"label": "Principal activity:",
"#text": "47.78C - \r\n\t\t\t\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\t\t\tAutres commerces de détail spécialisés divers"
}
},
{
"p": {
"label": {
"sup": "*",
"#text": [
"Employee numbers",
":"
]
}
}
},
{
"p": {
"label": "Year employee numbers verified:"
}
}
]
}
}
Upvotes: 0
Views: 40
Reputation: 84465
You can take your string and use re to do some string cleaning after extracting the items of interest. This is particular to your json given
import re
s = {
"div": {
"@class": "result-left",
"h3": "Establishment(s)",
"div": [
{
"label": "Status:",
"#text": "Closed"
},
{
"p": {
"label": "Brand name:",
"#text": "LE ZODIAC"
}
},
{
"p": {
"label": "Usual name:"
}
},
{
"p": {
"label": "Address:",
"br": [
"",
"54000\r\n\t\t\t\t\t\t\t\t\t\t\tNANCY"
],
"#text": "47\r\n\t\t\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\t\tRUE\r\n\t\t\t\t\t\t\t\t\t\tSERGENT BLANDAN"
}
},
{
"p": {
"label": "Principal activity:",
"#text": "47.78C - \r\n\t\t\t\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\t\t\tAutres commerces de détail spécialisés divers"
}
},
{
"p": {
"label": {
"sup": "*",
"#text": [
"Employee numbers",
":"
]
}
}
},
{
"p": {
"label": "Year employee numbers verified:"
}
}
]
}
}
result = re.sub(r'\r\n\t+',' ',' '.join([s['div']['div'][3]['p']['br'][1], s['div']['div'][3]['p']['#text']]))
print(result)
Upvotes: 1
Reputation: 20450
Having lots of repeated tabs, CRLFs, and other whitespace doesn't seem very convenient. It would be worth your while to define this function:
def simplify_ws(s: str):
"""Coalesces multiple whitespace, e.g. 'a b c' --> 'a b c'."""
return ' '.join(s.split())
Your dictionary is nice and quite complete, so it could certainly be used for a solution. But it would be more convenient to have bs4 iterate over just your favorite paragraphs:
for p in soup.find_all('p'):
txt = p.get_text()
if 'Address:' in txt:
print(simplify_ws(txt))
You may want to do some more filtering and munging on top of that.
Upvotes: 0