Reputation: 229
I am trying to parse the table from this website. I started with just the Username
column and with the help I got on stackoverflow, I was able to get the content of Username
with the following code:
with open("Top 50 TikTok users sorted by Followers - Socialblade TikTok Stats _ TikTok Statistics.html", "r", encoding="utf-8") as file:
soup = BeautifulSoup(str(file.readlines()), "html.parser")
tiktok = []
for tag in soup.select("div div:nth-of-type(n+5) > div > a"):
tiktok.append(tag.text)
which gives me
['addison rae',
'Bella Poarch',
'Zach King',
'TikTok',
'Spencer X',
'Will Smith',
'Loren Gray',
'dixie',
'Michael Le',
'Jason Derulo',
'Riyaz',
.
.
.
My ultimate goal is to populate the entire table with [Rank, Grade, Username, Uploads, Followers, Following, Likes]
I have read a few articles on Parsing HTML Tables in Python with BeautifulSoup and pandas
but it didn’t work since this is not defined as a table in the source. What are some of the alternatives to get this as a table in Python?
Upvotes: 1
Views: 1609
Reputation: 195418
You can use this code how to load the HTML from file to soup and then parse the table into dataframe:
import pandas as pd
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("page.html", "r").read(), "html.parser")
data = []
for div in soup.select('div[style*="fafafa"], div[style*="f8f8f8"]'):
data.append(
[
d.get_text(strip=True)
for d in div.find_all("div", recursive=False)[:8]
]
)
df = pd.DataFrame(
data,
columns=[
"Rank",
"Grade",
"Username",
"Uploads",
"Followers",
"Following",
"Likes",
"Interactions",
],
)
print(df)
df.to_csv("data.csv", index=False)
Prints:
Rank Grade Username Uploads Followers Following Likes Interactions
0 1st A++ charli d’amelio 1,755 113,600,000 1,210 9,200,000,000 --
1 2nd A++ addison rae 1,411 79,900,000 2,454 5,100,000,000 --
2 3rd A++ Bella Poarch 282 63,600,000 588 1,400,000,000 --
3 4th A++ Zach King 277 58,800,000 41 723,400,000 --
4 5th A++ TikTok 139 52,900,000 495 250,300,000 91
5 6th A++ Spencer X 1,250 52,700,000 7,206 1,300,000,000 --
6 7th A++ Will Smith 73 52,500,000 23 314,400,000 --
7 8th A++ Loren Gray 2,805 52,100,000 221 2,800,000,000 --
8 9th A++ dixie 120 51,200,000 1,267 2,900,000,000 --
9 10th A++ Michael Le 1,158 47,400,000 93 1,300,000,000 --
10 11th A+ Jason Derulo 675 44,900,000 12 1,000,000,000 --
11 12th A+ Riyaz 2,056 44,100,000 43 2,100,000,000 --
12 13th A+ Kimberly Loaiza ✨ 1,150 41,000,000 123 2,200,000,000 --
13 14th A+ Brent Rivera 955 37,800,000 272 1,200,000,000 --
14 15th A+ cznburak 1,301 37,300,000 1 688,700,000 --
15 16th A+ The Rock 42 36,200,000 1 200,300,000 --
16 17th A+ James Charles 238 36,200,000 148 881,400,000 --
17 18th A+ BabyAriel 2,365 35,300,000 326 1,900,000,000 --
18 19th A+ JoJo Siwa 1,206 33,500,000 346 1,100,000,000 --
19 20th A+ avani 5,347 33,300,000 5,003 2,400,000,000 --
20 21st A+ GIL CROES 693 32,900,000 454 803,200,000 --
21 22nd A+ Faisal shaikh 461 32,200,000 -- 2,000,000,000 --
22 23rd A+ BTS 39 32,000,000 -- 557,100,000 255
23 24th A+ LILHUDDY 4,187 30,500,000 8,652 1,600,000,000 --
24 25th A+ Stokes Twins 548 30,100,000 21 781,000,000 --
25 26th A+ Joe 1,487 29,800,000 8,402 1,200,000,000 --
26 27th A+ ROD🥴 1,792 29,500,000 536 1,700,000,000 --
27 28th A+ 𝙳𝚘𝚖𝚒𝚗𝚒𝚔 899 29,400,000 216 1,700,000,000 --
28 29th A+ Kylie Jenner 69 29,400,000 14 318,800,000 --
29 30th A+ Junya/じゅんや 2,823 29,000,000 1,934 533,800,000 12,200
30 31st A+ YZ 816 28,900,000 563 554,700,000 --
31 32nd A+ Arishfa Khan🦁 2,026 28,600,000 27 1,100,000,000 --
32 33rd A+ Lucas and Marcus 1,248 28,500,000 158 806,500,000 --
33 34th A+ jannat_zubair29 1,054 28,200,000 6 746,300,000 47
34 35th A+ Nisha Guragain 1,751 28,000,000 33 756,300,000 --
35 36th A+ Selena Gomez 40 27,800,000 17 82,300,000 --
36 37th A+ Kris HC 1,049 27,800,000 1,405 1,200,000,000 --
37 38th A+ flighthouse 4,200 27,600,000 488 2,300,000,000 --
38 39th A+ wigofellas 1,251 27,500,000 812 707,200,000 --
39 40th A+ Savannah LaBrant 1,860 27,300,000 155 1,400,000,000 --
40 41st A+ noah beck 1,395 26,900,000 2,297 1,700,000,000 --
41 42nd A+ Liza Koshy 155 26,700,000 104 321,900,000 --
42 43rd A+ Kirya Kolesnikov 1,338 26,400,000 78 543,200,000 --
43 44th A+ Awez Darbar 2,708 26,100,000 208 1,100,000,000 --
44 45th A+ Carlos Feria 2,522 25,700,000 138 1,200,000,000 --
45 46th A+ Kira Kosarin 837 25,700,000 401 447,000,000 --
46 47th A+ Naim Darrechi🏆 2,634 25,300,000 527 2,200,000,000 --
47 48th A+ Josh Richards 1,899 24,900,000 9,847 1,600,000,000 --
48 49th A+ Q Park 231 24,800,000 3 294,100,000 --
49 50th A+ TikTok_India 186 24,500,000 191 40,100,000 --
And saves data.csv
(screenshot from LibreOffice):
EDIT: To get URL username:
import pandas as pd
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("page.html", "r").read(), "html.parser")
data = []
for div in soup.select('div[style*="fafafa"], div[style*="f8f8f8"]'):
data.append(
[
d.get_text(strip=True)
for d in div.find_all("div", recursive=False)[:8]
]
+ [div.a["href"].split("/")[-1]]
)
df = pd.DataFrame(
data,
columns=[
"Rank",
"Grade",
"Username",
"Uploads",
"Followers",
"Following",
"Likes",
"Interactions",
"URL username",
],
)
print(df)
df.to_csv("data.csv", index=False)
Upvotes: 1