jdwatermann
jdwatermann

Reputation: 21

Pulling all locations from a store locator using python and HTML

I am looking to pull all store locations from a company website using python for a class project. I am able to get the HTML data but struggling on how to only pull the json items from it that are found in the variable "markersData"

import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup

url = "https://locator.takeuchi-us.com/"
html = urllib.request.urlopen(url)
soup = BeautifulSoup(html.read())

data = html.read().decode()

print(data)

Looking to pull all data found in the variable "markersData" (ID, title, lat, lng, addship, city, state, zip, phone)


Abridged source code snippet from the target page:

<body data-theme="a" data-content-theme="a">
    <!-- ... -->
    <script type="text/javascript">
        var markersData = [                
                    {
                        "id": '0',
                        "title": 'Al Preston&#39;s Garage, Inc.',
                        "lat": '41.324549',
                        "lng": '-73.106124',
                        "addship": '810 Howe Avenue',
                        "city": 'Shelton',
                        "state": 'CT',
                        "zip": '06484',
                        "phone": '(203) 924-1747',
                        "fax": '(203) 924-4594',
                        "machinetype": 'Sales, Service & Parts for:<div data-inline="true" class="dealerExcavator" title="Available at this dealer"></div><div data-inline="true" class="dealerTrackLoader" title="Available at this dealer"></div><div data-inline="true" class="dealerWheelLoader" title="Available at this dealer"></div>',
                        "website": 'http://takeuchi-us.alprestonsequipment.com/home_takeuchi-us.asp',
                        "description": 
                        '<div class="popContainer">' +
                            '<div class="popTitle">Al Preston&#39;s Garage, Inc.</div>' +
                            '<div class="popBody">' +
                                '<span class="popBodyText">810 Howe Avenue</span><br />' +
                                '<span class="popBodyText">Shelton, CT 06484</span><br />' +
                                '<span class="popBodyText">Phone: (203) 924-1747</span><br />' +
                                '<span class="popBodyText">Fax: (203) 924-4594</span><br />' +
                                '<span class="popBodyText">Sales, Service & Parts for:<div data-inline="true" class="dealerExcavator" title="Available at this dealer"></div><div data-inline="true" class="dealerTrackLoader" title="Available at this dealer"></div><div data-inline="true" class="dealerWheelLoader" title="Available at this dealer"></div></span>' +
                                '<a href="http://takeuchi-us.alprestonsequipment.com/home_takeuchi-us.asp" target="_blank" class="popButton">Dealer Website</a>' +
                                '<a href="javascript:void(0)" onClick="directions(\'41.324549\', \'-73.106124\');" data-lat="41.324549" data-lng="-73.106124" class="popButton cs_ml_5">Dealer Directions</a>' +
                            '</div>' +
                       '</div>'
                    }
            ];
    </script>
</body>

Upvotes: 0

Views: 903

Answers (1)

Tomalak
Tomalak

Reputation: 338326

struggling on how to only pull the json items from it that are found in the variable "markersData"

There is no JSON in that page. There is JavaScript - looks similar, but is not the same thing at all.

You can get the JavaScript source code by reading the text content of the <script type="text/javascript"> element - since there are multiple script tags on the page, you should pick the one that contains the string "markersData".

This is easy enough to do in BeautifulSoup and I'm not going to post code for that.

The more complex part of the task is to make sense of the JS source code - a JSON parser will not help here, so we have to use a JavaScript parser and then extract the variable markersData from its output.

Assume this us the source code string we have extracted using BeautifulSoup (example taken verbatim from the source code of your target page):

        var markersData = [

                    {
                        "id": '0',
                        "title": 'Al Preston&#39;s Garage, Inc.',
                        "lat": '41.324549',
                        "lng": '-73.106124',
                        "addship": '810 Howe Avenue',
                        "city": 'Shelton',
                        "state": 'CT',
                        "zip": '06484',
                        "phone": '(203) 924-1747',
                        "fax": '(203) 924-4594',
                        "machinetype": 'Sales, Service & Parts for:<div data-inline="true" class="dealerExcavator" title="Available at this dealer"></div><div data-inline="true" class="dealerTrackLoader" title="Available at this dealer"></div><div data-inline="true" class="dealerWheelLoader" title="Available at this dealer"></div>',
                        "website": 'http://takeuchi-us.alprestonsequipment.com/home_takeuchi-us.asp',
                        "description": 
                        '<div class="popContainer">' +
                            '<div class="popTitle">Al Preston&#39;s Garage, Inc.</div>' +
                            '<div class="popBody">' +
                                '<span class="popBodyText">810 Howe Avenue</span><br />' +
                                '<span class="popBodyText">Shelton, CT 06484</span><br />' +
                                '<span class="popBodyText">Phone: (203) 924-1747</span><br />' +
                                '<span class="popBodyText">Fax: (203) 924-4594</span><br />' +
                                '<span class="popBodyText">Sales, Service & Parts for:<div data-inline="true" class="dealerExcavator" title="Available at this dealer"></div><div data-inline="true" class="dealerTrackLoader" title="Available at this dealer"></div><div data-inline="true" class="dealerWheelLoader" title="Available at this dealer"></div></span>' +
                                '<a href="http://takeuchi-us.alprestonsequipment.com/home_takeuchi-us.asp" target="_blank" class="popButton">Dealer Website</a>' +
                                '<a href="javascript:void(0)" onClick="directions(\'41.324549\', \'-73.106124\');" data-lat="41.324549" data-lng="-73.106124" class="popButton cs_ml_5">Dealer Directions</a>' +
                            '</div>' +
                       '</div>'
                    }
        ];

One of the Python tools that can parse JavaScript is pyjsparser, which is based on the fast and feature-complete JS parser esprima. It will turn JS source code into what is called an "Abstract Syntax Tree" - very much like an HTML DOM, but with different node types.

from pyjsparser import parse

js_code = """ the sample JS code above """
program = parse(js_code)

The resulting syntax tree is made up of nested dicts and lists. To understand how it looks like, we dump out the tree in human-readable format. The json module does a much better job at this than Python's native pretty-printer pprint:

import json
print(json.dumps(program, indent='  '))

Result:

{
  "type": "Program",
  "body": [
    {
      "type": "VariableDeclaration",
      "declarations": [
        {
          "type": "VariableDeclarator",
          "id": {
            "type": "Identifier",
            "name": "markersData"
          },
          "init": {
            "type": "ArrayExpression",
            "elements": [
              {
                "type": "ObjectExpression",
                "properties": [
                  {
                    "type": "Property",
                    "key": {
                      "type": "Literal",
                      "value": "id",
                      "raw": "\"id\""
                    },
                    "computed": false,
                    "value": {
                      "type": "Literal",
                      "value": "0",
                      "raw": "'0'"
                    },
                    "kind": "init",
                    "method": false,
                    "shorthand": false
                  },
                  {
                    "type": "Property",
                    "key": {
                      "type": "Literal",
                      "value": "title",
                      "raw": "\"title\""
                    },
                    "computed": false,
                    "value": {
                      "type": "Literal",
                      "value": "Al Preston&#39;s Garage, Inc.",
                      "raw": "'Al Preston&#39;s Garage, Inc.'"
                    },
                    "kind": "init",
                    "method": false,
                    "shorthand": false
                  },
                  {
                    "type": "Property",
                    "key": {
                      "type": "Literal",
                      "value": "lat",
                      "raw": "\"lat\""
                    },
                    "computed": false,
                    "value": {
                      "type": "Literal",
                      "value": "41.324549",
                      "raw": "'41.324549'"
                    },
                    "kind": "init",
                    "method": false,
                    "shorthand": false
                  },
                  {
                    "type": "Property",
                    "key": {
                      "type": "Literal",
                      "value": "lng",
                      "raw": "\"lng\""
                    },
                    "computed": false,
                    "value": {
                      "type": "Literal",
                      "value": "-73.106124",
                      "raw": "'-73.106124'"
                    },
                    "kind": "init",
                    "method": false,
                    "shorthand": false
                  },
                  {
                    "type": "Property",
                    "key": {
                      "type": "Literal",
                      "value": "addship",
                      "raw": "\"addship\""
                    },
                    "computed": false,
                    "value": {
                      "type": "Literal",
                      "value": "810 Howe Avenue",
                      "raw": "'810 Howe Avenue'"
                    },
                    "kind": "init",
                    "method": false,
                    "shorthand": false
                  },
                  {
                    "type": "Property",
                    "key": {
                      "type": "Literal",
                      "value": "city",
                      "raw": "\"city\""
                    },
                    "computed": false,
                    "value": {
                      "type": "Literal",
                      "value": "Shelton",
                      "raw": "'Shelton'"
                    },
                    "kind": "init",
                    "method": false,
                    "shorthand": false
                  },
                  {
                    "type": "Property",
                    "key": {
                      "type": "Literal",
                      "value": "state",
                      "raw": "\"state\""
                    },
                    "computed": false,
                    "value": {
                      "type": "Literal",
                      "value": "CT",
                      "raw": "'CT'"
                    },
                    "kind": "init",
                    "method": false,
                    "shorthand": false
                  },
                  {
                    "type": "Property",
                    "key": {
                      "type": "Literal",
                      "value": "zip",
                      "raw": "\"zip\""
                    },
                    "computed": false,
                    "value": {
                      "type": "Literal",
                      "value": "06484",
                      "raw": "'06484'"
                    },
                    "kind": "init",
                    "method": false,
                    "shorthand": false
                  },
                  {
                    "type": "Property",
                    "key": {
                      "type": "Literal",
                      "value": "phone",
                      "raw": "\"phone\""
                    },
                    "computed": false,
                    "value": {
                      "type": "Literal",
                      "value": "(203) 924-1747",
                      "raw": "'(203) 924-1747'"
                    },
                    "kind": "init",
                    "method": false,
                    "shorthand": false
                  },
                  {
                    "type": "Property",
                    "key": {
                      "type": "Literal",
                      "value": "fax",
                      "raw": "\"fax\""
                    },
                    "computed": false,
                    "value": {
                      "type": "Literal",
                      "value": "(203) 924-4594",
                      "raw": "'(203) 924-4594'"
                    },
                    "kind": "init",
                    "method": false,
                    "shorthand": false
                  },
                  {
                    "type": "Property",
                    "key": {
                      "type": "Literal",
                      "value": "machinetype",
                      "raw": "\"machinetype\""
                    },
                    "computed": false,
                    "value": {
                      "type": "Literal",
                      "value": "Sales, Service & Parts for:<div data-inline=\"true\" class=\"dealerExcavator\" title=\"Available at this dealer\"></div><div data-inline=\"true\" class=\"dealerTrackLoader\" title=\"Available at this dealer\"></div><div data-inline=\"true\" class=\"dealerWheelLoader\" title=\"Available at this dealer\"></div>",
                      "raw": "'Sales, Service & Parts for:<div data-inline=\"true\" class=\"dealerExcavator\" title=\"Available at this dealer\"></div><div data-inline=\"true\" class=\"dealerTrackLoader\" title=\"Available at this dealer\"></div><div data-inline=\"true\" class=\"dealerWheelLoader\" title=\"Available at this dealer\"></div>'"
                    },
                    "kind": "init",
                    "method": false,
                    "shorthand": false
                  },
                  {
                    "type": "Property",
                    "key": {
                      "type": "Literal",
                      "value": "website",
                      "raw": "\"website\""
                    },
                    "computed": false,
                    "value": {
                      "type": "Literal",
                      "value": "http://takeuchi-us.alprestonsequipment.com/home_takeuchi-us.asp",
                      "raw": "'http://takeuchi-us.alprestonsequipment.com/home_takeuchi-us.asp'"
                    },
                    "kind": "init",
                    "method": false,
                    "shorthand": false
                  },
                  {
                    "type": "Property",
                    "key": {
                      "type": "Literal",
                      "value": "description",
                      "raw": "\"description\""
                    },
                    "computed": false,
                    "value": {
                      "type": "BinaryExpression",
                      "operator": "+",
                      "left": {
                        "type": "BinaryExpression",
                        "operator": "+",
                        "left": {
                          "type": "BinaryExpression",
                          "operator": "+",
                          "left": {
                            "type": "BinaryExpression",
                            "operator": "+",
                            "left": {
                              "type": "BinaryExpression",
                              "operator": "+",
                              "left": {
                                "type": "BinaryExpression",
                                "operator": "+",
                                "left": {
                                  "type": "BinaryExpression",
                                  "operator": "+",
                                  "left": {
                                    "type": "BinaryExpression",
                                    "operator": "+",
                                    "left": {
                                      "type": "BinaryExpression",
                                      "operator": "+",
                                      "left": {
                                        "type": "BinaryExpression",
                                        "operator": "+",
                                        "left": {
                                          "type": "BinaryExpression",
                                          "operator": "+",
                                          "left": {
                                            "type": "Literal",
                                            "value": "<div class=\"popContainer\">",
                                            "raw": "'<div class=\"popContainer\">'"
                                          },
                                          "right": {
                                            "type": "Literal",
                                            "value": "<div class=\"popTitle\">Al Preston&#39;s Garage, Inc.</div>",
                                            "raw": "'<div class=\"popTitle\">Al Preston&#39;s Garage, Inc.</div>'"
                                          }
                                        },
                                        "right": {
                                          "type": "Literal",
                                          "value": "<div class=\"popBody\">",
                                          "raw": "'<div class=\"popBody\">'"
                                        }
                                      },
                                      "right": {
                                        "type": "Literal",
                                        "value": "<span class=\"popBodyText\">810 Howe Avenue</span><br />",
                                        "raw": "'<span class=\"popBodyText\">810 Howe Avenue</span><br />'"
                                      }
                                    },
                                    "right": {
                                      "type": "Literal",
                                      "value": "<span class=\"popBodyText\">Shelton, CT 06484</span><br />",
                                      "raw": "'<span class=\"popBodyText\">Shelton, CT 06484</span><br />'"
                                    }
                                  },
                                  "right": {
                                    "type": "Literal",
                                    "value": "<span class=\"popBodyText\">Phone: (203) 924-1747</span><br />",
                                    "raw": "'<span class=\"popBodyText\">Phone: (203) 924-1747</span><br />'"
                                  }
                                },
                                "right": {
                                  "type": "Literal",
                                  "value": "<span class=\"popBodyText\">Fax: (203) 924-4594</span><br />",
                                  "raw": "'<span class=\"popBodyText\">Fax: (203) 924-4594</span><br />'"
                                }
                              },
                              "right": {
                                "type": "Literal",
                                "value": "<span class=\"popBodyText\">Sales, Service & Parts for:<div data-inline=\"true\" class=\"dealerExcavator\" title=\"Available at this dealer\"></div><div data-inline=\"true\" class=\"dealerTrackLoader\" title=\"Available at this dealer\"></div><div data-inline=\"true\" class=\"dealerWheelLoader\" title=\"Available at this dealer\"></div></span>",
                                "raw": "'<span class=\"popBodyText\">Sales, Service & Parts for:<div data-inline=\"true\" class=\"dealerExcavator\" title=\"Available at this dealer\"></div><div data-inline=\"true\" class=\"dealerTrackLoader\" title=\"Available at this dealer\"></div><div data-inline=\"true\" class=\"dealerWheelLoader\" title=\"Available at this dealer\"></div></span>'"
                              }
                            },
                            "right": {
                              "type": "Literal",
                              "value": "<a href=\"http://takeuchi-us.alprestonsequipment.com/home_takeuchi-us.asp\" target=\"_blank\" class=\"popButton\">Dealer Website</a>",
                              "raw": "'<a href=\"http://takeuchi-us.alprestonsequipment.com/home_takeuchi-us.asp\" target=\"_blank\" class=\"popButton\">Dealer Website</a>'"
                            }
                          },
                          "right": {
                            "type": "Literal",
                            "value": "<a href=\"javascript:void(0)\" onClick=\"directions('41.324549', '-73.106124');\" data-lat=\"41.324549\" data-lng=\"-73.106124\" class=\"popButton cs_ml_5\">Dealer Directions</a>",
                            "raw": "'<a href=\"javascript:void(0)\" onClick=\"directions(\\'41.324549\\', \\'-73.106124\\');\" data-lat=\"41.324549\" data-lng=\"-73.106124\" class=\"popButton cs_ml_5\">Dealer Directions</a>'"
                          }
                        },
                        "right": {
                          "type": "Literal",
                          "value": "</div>",
                          "raw": "'</div>'"
                        }
                      },
                      "right": {
                        "type": "Literal",
                        "value": "</div>",
                        "raw": "'</div>'"
                      }
                    },
                    "kind": "init",
                    "method": false,
                    "shorthand": false
                  }
                ]
              }
            ]
          }
        }
      ],
      "kind": "var"
    },
    {
      "type": "EmptyStatement"
    }
  ]
}

Each node has a type and some data, depending on what type it is. A Program node has a body list of child nodes, an ArrayExpression has elements, an ObjectExpression has properties, Literals have a value, and so on.

The complex bit here is the HMTL string concatenation at the end - scroll all the way down - it consists of nested BinaryExpression nodes where each one has an operator (+) with a left side and a right side.

To convert all of this into something that is usable in Python, we must do what's called "tree walking" (or "tree traversal"). We visit every node of the tree, decide what to do with it, and then move on to its child nodes.

A simple tree walker starts at some node, looks at all the properties of that node, and then tries all properties that are lists. A predicate function is used to filter out nodes we are interested in:

def walk(node, predicate):
    ''' traverses the syntax tree and finds matching nodes (recursive) '''
    descend = None

    if not (isinstance(node, dict) and 'type' in node):
        return

    if predicate(node):
        yield node

    for key in node:
        if isinstance(node[key], list):
            for child in node[key]:
                yield from walk(child, predicate)

To find the VariableDeclarator node identified by the name markersData, we would use it like this:

is_markersData = lambda node: node['type'] == 'VariableDeclarator' and node['id']['name'] == 'markersData'
markersData_node = next(walk(program, is_markersData))

Now that we have the right node, we must evaluate it. The following function can walk e.g. our markersData_node and calculate a result value. Compare what the function would to for each type of node in the syntax tree above:

def evaluate(node):
    ''' converts a JS literal to a Python data structure (recursive) '''

    # JS primitives are returned as their values
    if node['type'] == 'Literal':
        return node['value']

    # JS object literals consist of multiple key-value properties
    if node['type'] == 'ObjectExpression':
        return { evaluate(p['key']): evaluate(p['value']) for p in node['properties'] }

    # JS array literals are lists of elements that we evaluate individually
    if node['type'] == 'ArrayExpression':
        return [ evaluate(item) for item in node['elements'] ]

    # expressions (such as `'string1' + 'string2'`) we calculate
    if node['type'] == 'BinaryExpression':
        op = node['operator']
        left = evaluate(node['left'])
        right = evaluate(node['right'])
        if op == '+':
            return left + right

        raise Exception("Unsuported operator %s on %s." % (op, node))

    # for variables we are interested in their initializer (the part after the `=`)
    if node['type'] == 'VariableDeclarator':
        return evaluate(node['init'])

    # everything else causes an error, we can implement it when we see it
    raise Exception("Don't know what to do with a %s." % node['type'])

It understands just enough of JavaScript to figure out our example. Support for other things (such as dates, regexes, other operations than +) can be added when they occur.

When we call it:

markersData = evaluate(markersData_node)
print(json.dumps(markersData[0:1], indent="    "))

we get:

[
    {
        "id": "0",
        "title": "Al Preston&#39;s Garage, Inc.",
        "lat": "41.324549",
        "lng": "-73.106124",
        "addship": "810 Howe Avenue",
        "city": "Shelton",
        "state": "CT",
        "zip": "06484",
        "phone": "(203) 924-1747",
        "fax": "(203) 924-4594",
        "machinetype": "Sales, Service & Parts for:<div data-inline=\"true\" class=\"dealerExcavator\" title=\"Available at this dealer\"></div><div data-inline=\"true\" class=\"dealerTrackLoader\" title=\"Available at this dealer\"></div><div data-inline=\"true\" class=\"dealerWheelLoader\" title=\"Available at this dealer\"></div>",
        "website": "http://takeuchi-us.alprestonsequipment.com/home_takeuchi-us.asp",
        "description": "<div class=\"popContainer\"><div class=\"popTitle\">Al Preston&#39;s Garage, Inc.</div><div class=\"popBody\"><span class=\"popBodyText\">810 Howe Avenue</span><br /><span class=\"popBodyText\">Shelton, CT 06484</span><br /><span class=\"popBodyText\">Phone: (203) 924-1747</span><br /><span class=\"popBodyText\">Fax: (203) 924-4594</span><br /><span class=\"popBodyText\">Sales, Service & Parts for:<div data-inline=\"true\" class=\"dealerExcavator\" title=\"Available at this dealer\"></div><div data-inline=\"true\" class=\"dealerTrackLoader\" title=\"Available at this dealer\"></div><div data-inline=\"true\" class=\"dealerWheelLoader\" title=\"Available at this dealer\"></div></span><a href=\"http://takeuchi-us.alprestonsequipment.com/home_takeuchi-us.asp\" target=\"_blank\" class=\"popButton\">Dealer Website</a><a href=\"javascript:void(0)\" onClick=\"directions('41.324549', '-73.106124');\" data-lat=\"41.324549\" data-lng=\"-73.106124\" class=\"popButton cs_ml_5\">Dealer Directions</a></div></div>"
    }
]

which matches exactly what the JS code would produce in the browser - but here it's an actual Python data structure that we can use. I've tested it against the whole JS source from your target page and it works for that.

(All of this is what would have been much easier, had the input actually been JSON to begin with.)

Upvotes: 2

Related Questions