Jake
Jake

Reputation: 2923

Unable to locate html tag for scraping

I'm not great in HTML, so am a bit stumbled for this.

I'm trying to scrape instagram datetime posts using python, and realised that the datetime information isn't without the html document of the post. However, I am able to query it using inspect element. See below screen shot.

Inspect element of date (below follow button

Where is this datetime information located exactly, and how can I obtain it?

The example I took from is this random post "https://www.instagram.com/p/BEtMWWbjoPh/". Element is at the "12h" displayed in the page.

[Update] I am using urllib to grab the url, and bs4 in python to scrape. The output did not return anything with datetime. The code is below. I also printed out the entire html and I was surprised that it does not contain datetime in it.

html = urllib.urlopen(url).read()
soup = BeautifulSoup(html, 'html.parser')
tags = soup.select('time')
    for tag in tags:
        dateT = tag.get('datetime').getText()
        print dateT

Upvotes: 1

Views: 148

Answers (2)

BananaNeil
BananaNeil

Reputation: 10762

I think the problem that you are experiencing is that urllib.urlopen(url).read() does not execute any javascript that is on the page.

Because Instagram is a client side javascript app that uses your browser to render their site, you'll need some sort of browser client to evaluate the javascript and then find the element on the page. For this, I usually use phantomjs (I usually use it with the ruby driver Capybara, but I would assume that there is a python package that would work similarly)

HOWEVER, if you execute urllib.urlopen(url).read(), you should see a block of JSON in a script tag that begins with <script type="text/javascript">window._sharedData = {...

That block of JSON will include the data you are looking for. If you were to evaluate that JSON, and parse it, you should be able access the time data you are looking for.

That being said, the better way to do this is to use instagram's api to do the the crawling. They make all of this data available to developers, so you don't have to crawl an ever-changing webpage.

(Apparently Instagram's API will only return public data for users who have explicitly given your app permission)

Upvotes: 1

user6105387
user6105387

Reputation:

In your developer console, type this:

document.getElementsByTagName('time')[0].getAttribute('datetime');

This will return the data you are looking for. The above code is simply looking through the HTML for the tag name time, of which there is only one, then grabbing the datetime property from it.

As for python, check out BeautifulSoup if you haven't already. This library will allow you to do a similar thing in python:

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
soup.time['datetime']

Where html_doc is your raw HTML. To obtain the raw HTML, use the requests library.

Upvotes: 3

Related Questions