Reputation: 189
I am trying to scraping data off https://gmatclub.com/forum/decision-tracker.html After a lot of hit and trials, I am still not able to identify how to get the data from the table?
import requests
from bs4 import BeautifulSoup
url = "https://gmatclub.com/forum/decision-tracker.html"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
container = soup.find('div', attrs = {'class' : 'mainPage'})
print(container)
Upvotes: 1
Views: 79
Reputation: 20042
If you want to practice, take a look at Developer Toos -> Network -> XHR
and grab the update endpoint:
https://gmatclub.com/api/schools/v1/forum/app-tracker-latest-updates?limit=50&year=all
and use it to get the current data.
Here's how:
import requests
with requests.Session() as connection:
connection.headers.update(
{
"referer": "https://gmatclub.com/forum/decision-tracker.html",
"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.86 YaBrowser/21.3.0.740 Yowser/2.5 Safari/537.36",
}
)
_ = connection.get("https://gmatclub.com/forum/decision-tracker.html")
endpoint = connection.get("https://gmatclub.com/api/schools/v1/forum/app-tracker-latest-updates?limit=50&year=all").json()
for item in endpoint["statistics"]:
print(item)
This will output a list of dictionaries, which are in fact your table. You can then access any key from it.
{'id': '194901', 'user_id': '273781', 'applicant_type': 'regular', 'round_id': '4236', 'status_id': '9', 'school_id': '5', 'school_title': 'Booth', 'program_id': '11', 'program_type': '1', 'date': '2021-05-24 23:56:46', 'seconds_ago': '511', 'country': None, 'state': None, 'gmat_quant': None, 'gmat_verbal': None, 'gmat_total': None, 'gmat_modified': None, 'gre_quant': None, 'gre_verbal': None, 'gre_total': None, 'gre_modified_time': None, 'ea_quant': None, 'ea_verbal': None, 'ea_ir': None, 'ea_total': None, 'ea_modified_time': None, 'cat_india_percentile': None, 'cat_india_total': None, 'cat_india_modified_time': None, 'industry': None, 'we': None, 'gpa': None, 'accepted_via': 'phone', 'scholarship': '1', 'user_colour': '', 'truncate_username': '0', 'user_name': 'binhtbc'}
Or you can just dump the response to pandas dataframe
. For example:
df = pd.DataFrame(endpoint["statistics"])
print(df.head(10))
Output:
id user_id applicant_type ... user_colour truncate_username user_name
0 194901 273781 regular ... 0 binhtbc
1 183152 643532 regular ... 0 AG23
2 194061 None regular ... 2a2a2a 0 private
3 192923 1034549 regular ... 0 RicardoLima
4 193383 1034549 regular ... 0 RicardoLima
5 194900 1130431 regular ... None VFA
6 177937 876400 regular ... F87431 0 icanhazmba
7 194899 1128750 regular ... None Amanda29
8 194898 1128002 regular ... None Raydiaz
9 193974 1021516 regular ... 0 Kurathore
And, if you feel like it, save this as .csv
file:
df.to_csv("your_table_data.csv", index=False)
Upvotes: 5