Gorkem
Gorkem

Reputation: 35

Scraping the links from a specific url

this is my first question if I have explained anything wrong please forgive me.

I am trying scrape url's from a specific website in python and parse the links to a csv. The thing is when i parse the website in BeautifulSoup I can't extract the url's because when I parse it in python I can only get <div id="dvScores" style="min-height: 400px;">\n</div>, and nothing under that branch. But when I open the console and copy the table where the links are and paste it to a text editor it pastes 600 pages of html. What I want to do is to write a for loop that shows the links. The structure of the html is below:

<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
#shadow-root (open)
<head>...</head>
<body>
  <div id="body">
    <div id="wrapper">
      #multiple divs but i don't need them
      <div id="live-master"> #what I need is under this div
        <span id="contextual"> 
          #multiple divs but i don't need them
          <div id="live-score-master"> #what I need is under this div
            <div ng-app="live-menu" id="live-score-rightcoll">
              #multiple divs but i don't need them
              <div id="left-score-lefttemp" style="padding-top: 35px;">
                <div id="dvScores">
                  <table cellspacing=0 ...>
                    <colgroup>...</colgroup>
                    <tbody>
                      <tr class="row line-bg1"> #this changes to bg2 or bg3
                        <td class="row"> 
                          <span class="row">
                          <a href="www.example.com" target="_blank" class="td_row">
                                  #I need to extract this link
                          </span>
                        </td>
                        #Multiple td's
                      </tr>
                      #multiple tr class="row line-bg1" or "row line-bg2"
                      .
                      .
                      .
                    </tbody>
                  </table>
                  </div>
                </div>
              </div>
            </div>
        </span>
    </div>
  </div>
</body>
</html>

What am I doing wrong? I need to automate a system for python to do rather than pasting the html to text and extracting links with a regex. My python code is below also:

import requests
from bs4 import BeautifulSoup
r=requests.get("http://example.com/example")
c=r.content
soup=BeautifulSoup(c,"html.parser")
all=soup.find_all("span",id="contextual")
span=all[0].find_all("tbody")

Upvotes: 2

Views: 4154

Answers (3)

t.m.adam
t.m.adam

Reputation: 15376

if you are trying scrape urls then you should get hrefs :

urls = soup.find_all('a', href=True)

Upvotes: 1

vold
vold

Reputation: 1549

This site uses JavaScript for populating its content, therefore, you can't get url via beautifulsoup. If you inspect network tab in your browser you can spot a this link. It contains all data what you need. You can simply parse it and extract all desired value.

import requests

req = requests.get('http://goapi.mackolik.com/livedata?group=0').json()
for el in req['m'][4:100]:
    index = el[0]
    team_1 = el[2].replace(' ', '-')
    team_2 = el[4].replace(' ', '-')
    print('http://www.mackolik.com/Mac/{}/{}-{}'.format(index, team_1, team_2))

Upvotes: 0

Dashadower
Dashadower

Reputation: 672

It seems like the html is being dynamically generated by js. You would need to crawl it with a crawler to mimic a browser. Since you are using requests, it already has a crawler session.

session = requests.session()
data = session.get ("http://website.com").content #usage xample

After this you can do the parsing, additional scraping, etc.

Upvotes: 0

Related Questions