Reputation: 725
I scraped some table data from a website. The actual table on the website looks like this -
I used beautifulsoup to get this data which in this format -
[<td class="TableHeadingLeft" width="175">
Team
</td>,
<td class="TableHeadingRight" width="35">
Mat
</td>,
<td class="TableHeadingRight" width="35">
Won
</td>,
<td class="TableHeadingRight" width="35">
Lost
</td>,
<td class="TableHeadingRight" width="35">
Tied
</td>,
<td class="TableHeadingRight" width="35">
N/R
</td>,
<td class="TableHeadingRight" width="45">
Points
</td>,
<td class="TableHeadingRight" width="55">
Net R/R
</td>,
<td class="TableHeadingRight" width="75">
For
</td>,
<td class="TableHeadingRight" width="75">
Against
</td>,
<td align="left">
<a class="LinkOff" href="MatchList.asp?s=2019&t=MIN">Mumbai Indians</a>
</td>,
<td align="right">
14
</td>,
<td align="right">
9
</td>,
<td align="right">
5
</td>,
<td align="right">
0
</td>,
<td align="right">
0
</td>,
<td align="right">
18
</td>,
<td align="right">
+0.421
</td>,
<td align="right">
2380/275.1
</td>,
<td align="right">
2282/277.2
</td>,
<td align="left">
<a class="LinkOff" href="MatchList.asp?s=2019&t=CSK">Chennai Super Kings</a>
</td>,
<td align="right">
14
</td>,
<td align="right">
9
</td>,
<td align="right">
5
</td>,
<td align="right">
0
</td>,
<td align="right">
0
</td>,
<td align="right">
18
</td>,
<td align="right">
+0.131
</td>,
<td align="right">
2043/274.1
</td>,
<td align="right">
2012/274.5
</td>,
<td align="left">
<a class="LinkOff" href="MatchList.asp?s=2019&t=DDV">Delhi Capitals</a>
</td>,
<td align="right">
14
</td>,
<td align="right">
9
</td>,
<td align="right">
5
</td>,
<td align="right">
0
</td>,
<td align="right">
0
</td>,
<td align="right">
18
</td>,
<td align="right">
+0.044
</td>,
<td align="right">
2207/272.5
</td>,
<td align="right">
2238/278.1
</td>,
<td align="left">
<a class="LinkOff" href="MatchList.asp?s=2019&t=SUN">Sunrisers Hyderabad</a>
</td>,
<td align="right">
14
</td>,
<td align="right">
6
</td>,
<td align="right">
8
</td>,
<td align="right">
0
</td>,
<td align="right">
0
</td>,
<td align="right">
12
</td>,
<td align="right">
+0.577
</td>,
<td align="right">
2288/269.2
</td>,
<td align="right">
2200/277.5
</td>,
<td align="left">
<a class="LinkOff" href="MatchList.asp?s=2019&t=KKR">Kolkata Knight Riders</a>
</td>,
<td align="right">
14
</td>,
<td align="right">
6
</td>,
<td align="right">
8
</td>,
<td align="right">
0
</td>,
<td align="right">
0
</td>,
<td align="right">
12
</td>,
<td align="right">
+0.028
</td>,
<td align="right">
2466/270.4
</td>,
<td align="right">
2419/266.2
</td>,
<td align="left">
<a class="LinkOff" href="MatchList.asp?s=2019&t=KXI">Kings XI Punjab</a>
</td>,
<td align="right">
14
</td>,
<td align="right">
6
</td>,
<td align="right">
8
</td>,
<td align="right">
0
</td>,
<td align="right">
0
</td>,
<td align="right">
12
</td>,
<td align="right">
-0.251
</td>,
<td align="right">
2429/276.3
</td>,
<td align="right">
2503/277.0
</td>,
<td align="left">
<a class="LinkOff" href="MatchList.asp?s=2019&t=RRO">Rajasthan Royals</a>
</td>,
<td align="right">
14
</td>,
<td align="right">
5
</td>,
<td align="right">
8
</td>,
<td align="right">
0
</td>,
<td align="right">
1
</td>,
<td align="right">
11
</td>,
<td align="right">
-0.449
</td>,
<td align="right">
2153/257.0
</td>,
<td align="right">
2192/248.2
</td>,
<td align="left">
<a class="LinkOff" href="MatchList.asp?s=2019&t=RCB">Royal Challengers Bangalore</a>
</td>,
<td align="right">
14
</td>,
<td align="right">
5
</td>,
<td align="right">
8
</td>,
<td align="right">
0
</td>,
<td align="right">
1
</td>,
<td align="right">
11
</td>,
<td align="right">
-0.607
</td>,
<td align="right">
2146/258.4
</td>,
<td align="right">
2266/254.3
</td>]
Now, I used a loop and some code to extract the data that I need.
for data in table_data.find_all('td'):
print(''.join(data.text.split()))
The output of this -
Team
Mat
Won
Lost
Tied
N/R
Points
NetR/R
For
Against
MumbaiIndians
14
9
5
0
0
18
+0.421
2380/275.1
2282/277.2
ChennaiSuperKings
14
9
5
0
0
18
+0.131
2043/274.1
2012/274.5
DelhiCapitals
14
9
5
0
0
18
+0.044
2207/272.5
2238/278.1
SunrisersHyderabad
14
6
8
0
0
12
+0.577
2288/269.2
2200/277.5
KolkataKnightRiders
14
6
8
0
0
12
+0.028
2466/270.4
2419/266.2
KingsXIPunjab
14
6
8
0
0
12
-0.251
2429/276.3
2503/277.0
RajasthanRoyals
14
5
8
0
1
11
-0.449
2153/257.0
2192/248.2
RoyalChallengersBangalore
14
5
8
0
1
11
-0.607
2146/258.4
2266/254.3
Created an empty dataframe of the desired shape -
col_name = ['Team','Mat','Won','Lost','Tied','N/R','Points','NetR/R','For','Against']
import pandas as pd
df = pd.DataFrame(data=None, columns=col_name)
Now, I couldn't able to understand, how to add these data to this dataframe.
Upvotes: 0
Views: 697
Reputation: 69
Once you have the table data as 1D list
table_data_list = [''.join(i.text.split() for i in table_data.find_all('td')]
Convert the data into 2D using numpy reshape
df = pd.DataFrame(np.array(table_data_list).reshape(-1,10))
Upvotes: 0
Reputation: 2615
There is a simple way to create a dataframe from list of list.
Code :
ll = []
for data in table_data.find_all('td'):
ll.append(''.join(data.text.split()))
n = 10
final = [ll[i:i + n] for i in range(0, len(ll), n)]
col = final [0]
df = pd.DataFrame(final)
df.columns = df.iloc[0]
df = df.drop(df.index[0])
df
ll
is list of list from your table data.
final
is split the ll
with 10 columns as you mentioned in columns
The last two lines are to make the first row as column name because we just created a dataframe without giving any column names.
Upvotes: 1
Reputation: 797
Use pd.concat
import pandas as pd
col_name = ['Team','Mat','Won','Lost','Tied','N/R','Points','NetR/R','For','Against']
# Main dataframe
df = pd.DataFrame(data=None, columns=col_name)
# df1 is the data frame that you want to concat in your original data frame
df = pd.concat([df, df1])
Upvotes: 0