user229519
user229519

Reputation: 73

web scraping with python for specific section

I have an array created by using soup findAll, and its first element has the following information. In this list I only need address information, which is
"54000 NANCY 47 RUE SERGENT BLANDAN", how can I get this information?

 {
  "div": {
    "@class": "result-left",
    "h3": "Establishment(s)",
    "div": [
      {
        "label": "Status:",
        "#text": "Closed"
      },
      {
        "p": {
          "label": "Brand name:",
          "#text": "LE ZODIAC"
        }
      },
      {
        "p": {
          "label": "Usual name:"
        }
      },
      {
        "p": {
          "label": "Address:",
          "br": [
            "",
            "54000\r\n\t\t\t\t\t\t\t\t\t\t\tNANCY"
          ],
          "#text": "47\r\n\t\t\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\t\tRUE\r\n\t\t\t\t\t\t\t\t\t\tSERGENT BLANDAN"
        }
      },
      {
        "p": {
          "label": "Principal activity:",
          "#text": "47.78C - \r\n\t\t\t\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\t\t\tAutres commerces de détail spécialisés divers"
        }
      },
      {
        "p": {
          "label": {
            "sup": "*",
            "#text": [
              "Employee numbers",
              ":"
            ]
          }
        }
      },
      {
        "p": {
          "label": "Year employee numbers verified:"
        }
      }
    ]
  }
}

Upvotes: 0

Views: 40

Answers (2)

QHarr
QHarr

Reputation: 84465

You can take your string and use re to do some string cleaning after extracting the items of interest. This is particular to your json given

import  re

s = {
  "div": {
    "@class": "result-left",
    "h3": "Establishment(s)",
    "div": [
      {
        "label": "Status:",
        "#text": "Closed"
      },
      {
        "p": {
          "label": "Brand name:",
          "#text": "LE ZODIAC"
        }
      },
      {
        "p": {
          "label": "Usual name:"
        }
      },
      {
        "p": {
          "label": "Address:",
          "br": [
            "",
            "54000\r\n\t\t\t\t\t\t\t\t\t\t\tNANCY"
          ],
          "#text": "47\r\n\t\t\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\t\tRUE\r\n\t\t\t\t\t\t\t\t\t\tSERGENT BLANDAN"
        }
      },
      {
        "p": {
          "label": "Principal activity:",
          "#text": "47.78C - \r\n\t\t\t\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\t\t\tAutres commerces de détail spécialisés divers"
        }
      },
      {
        "p": {
          "label": {
            "sup": "*",
            "#text": [
              "Employee numbers",
              ":"
            ]
          }
        }
      },
      {
        "p": {
          "label": "Year employee numbers verified:"
        }
      }
    ]
  }
}

result =  re.sub(r'\r\n\t+',' ',' '.join([s['div']['div'][3]['p']['br'][1], s['div']['div'][3]['p']['#text']]))
print(result)

Upvotes: 1

J_H
J_H

Reputation: 20450

Having lots of repeated tabs, CRLFs, and other whitespace doesn't seem very convenient. It would be worth your while to define this function:

def simplify_ws(s: str):
    """Coalesces multiple whitespace, e.g. 'a   b c' --> 'a b c'."""
    return ' '.join(s.split())

Your dictionary is nice and quite complete, so it could certainly be used for a solution. But it would be more convenient to have bs4 iterate over just your favorite paragraphs:

for p in soup.find_all('p'):
    txt = p.get_text()
    if 'Address:' in txt:
        print(simplify_ws(txt))

You may want to do some more filtering and munging on top of that.

Upvotes: 0

Related Questions