difficulty in crawling anchor tags from html using beautifulsoup in python3

Question

I am trying to extract the href's from the web page of an institution. I have to extract the dept codes for further crawling activity. and I have written following code:

import requests
import re
import urllib
from bs4 import BeautifulSoup

codesurl="http://www.iitkgp.ac.in/academics/?page=acadunits"
response = requests.get(codesurl)
# print(response.content)
soup=BeautifulSoup(response.content)
# print(soup.prettify())
p = re.compile('page=acadunits*')
p1 = re.compile('



But I am not getting all the href's for example :

Mechanical Engineering
Medical Science & Technology
Metallurgical & Materials Engineering


and many more
Can somebody help me with this.This is first time I am crawling. 
you can also look at the website.I need to extract dept code from url

dept=ME
dept=MT
dept=MD


My web page contains:

    


    
    
    


        Aerospace Engineering

        Agricultural & Food Engineering

        Architecture & Regional Planning

        Biotechnology

        Chemical Engineering

        Chemistry

        Civil Engineering

        Computer Science & Engineering

        Cryogenic Engineering

        Center for Educational Technology

        Electrical Engineering

         Electronics & Electrical Communication Engineering

        G S Sanyal School of Telecommunications

        Geology & Geophysics

        Humanities & Social Sciences

        Industrial & Systems Engineering

        Information Technology

        Materials Science

        Mathematics

        Mechanical Engineering

        Medical Science & Technology

        Metallurgical & Materials Engineering

        Mining Engineering

        Ocean Engineering & Naval Architecture

        Oceans, Rivers, Atmosphere and Land Sciences

        Physics

        P K Sinha Centre for Bio Energy

        Rajendra Mishra School of Engineering Entrepreneurship

        Rajiv Gandhi School of Intellectual Property Law

        Ranbir and Chitra Gupta School of Infrastructure Design and Management

        Reliability Engineering Centre

        Rubber Technology Centre

        Rural Development Centre

        School of Bioscience

        School of Energy Science & Engineering

        School of Environmental Science and Technology

        School of Nano-Science and Technology

        School of Water Resources

        Vinod Gupta School of Management

    



   
    


but when I do :

codesurl="http://www.iitkgp.ac.in/academics/?page=acadunits"
response = requests.get(codesurl)
soup=BeautifulSoup(response.text)


soup does not show these href's
can someone suggest how to extract these href tags??

alecxe · Accepted Answer

First of all, the department links are loaded dynamically with a GET request to this URL.

Then, the idea would be to find all links where href attribute value is matching a specific pattern and then use this pattern to extract the department codes. Working code:

import re

import requests
from bs4 import BeautifulSoup

codesurl = "http://www.iitkgp.ac.in/academics/academic.php"
response = requests.get(codesurl)
soup = BeautifulSoup(response.content, "lxml")

pattern = re.compile(r"dept=([A-Z]+)")
links = soup.find_all("a", href=pattern)

for link in links:
    print(pattern.search(link["href"]).group(1))

Prints:

AE
AG
AR
...
NT
WM
SM

difficulty in crawling anchor tags from html using beautifulsoup in python3

Answers (2)

Related Questions