Reputation: 984
I am trying to scrape a table from a web page.
<tr valign="top">
<td class="doprawej bezlewej">
AT00BUWOG001
</td>
<td class="doprawej">
P
</td>
<td class="doprawej">
</td>
<td class="doprawej">
142
</td>
<td class="doprawej">
<b>BUWOG</b>
</td>
<td class="doprawej">
124 184 779
</td>
<td class="doprawej">
16 019,84
</td>
<td class="doprawej">
12 476,29
</td>
<td class="doprawej">
2018-07-31
</td>
<td class="doprawej">
H
</td>
<td class="doprawej">
1,28
</td>
<td class="doprawej">
14,00
</td>
<td class="doprawej bezprawej">
2,30
</td>
</tr>
<tr valign="top">
<td class="doprawej bezlewej">
PLBRSTM00015
</td>
<td class="doprawej">
P
</td>
<td class="doprawej">
LA
</td>
<td class="doprawej">
180
</td>
<td class="doprawej">
<b>CALATRAVA</b>
</td>
<td class="doprawej">
15 000 000
</td>
<td class="doprawej">
3,45
</td>
<td class="doprawej">
7,93
</td>
<td class="doprawej">
2017-03-31
</td>
<td class="doprawej">
H
</td>
<td class="doprawej">
0,44
</td>
<td class="doprawej">
0,00
</td>
<td class="doprawej bezprawej">
0,00
</td>
</tr>
I tried pandas read_clipboard()
but the result I'm getting is data from a column ends up in different columns, because there are some empty columns in the table.
ISIN Market Segment ... PBV PE Div Yield
0 PLNFI0600010 P LA ... 2018-12-31 H 0,14
1 PLNFI0800016 P 141 ... H 0,55 160,00
2 PL11BTS00015 P 650 ... J 9,44 22,60
3 PL4FNMD00013 P 641 ... H 1,25 6,80
4 PLABCDT00014 R 612 ... H 0,94 0,00
5 PLABMSD00015 P 411 ... 0,00 0,00 0,00
6 PLAB00000019 P 612 ... H 0,39 5,10
7 PLACSA000014 P 541 ... J 4,20 13,00
8 PLACTIN00018 P 612 ... H 0,51 0,00
9 PLADVIV00015 P 720 ... H 2,07 0,00
Can I set some attributes in the read_clipboard()
so that a row of data always has the same length like in the HTML? and the data ends up in the right column?
Upvotes: 1
Views: 779
Reputation: 984
In pandas source the read_clipboard()
method is just a convienince wraper for read_csv()
, which means you can use all the arguments from read_csv()
in your method call
Upvotes: 0
Reputation: 46301
I tried read_html
method and added <table></table>
wrapper manually.
But you may use this:
from BeautifulSoup import BeautifulSoup
html = "..."
soup = BeautifulSoup(html)
print soup.prettify()
Here is what I tried:
html="""<table><tr valign="top">
<td class="doprawej bezlewej">
AT00BUWOG001
</td>
<td class="doprawej">
P
</td>
<td class="doprawej">
</td>
<td class="doprawej">
142
</td>
<td class="doprawej">
<b>BUWOG</b>
</td>
<td class="doprawej">
124 184 779
</td>
<td class="doprawej">
16 019,84
</td>
<td class="doprawej">
12 476,29
</td>
<td class="doprawej">
2018-07-31
</td>
<td class="doprawej">
H
</td>
<td class="doprawej">
1,28
</td>
<td class="doprawej">
14,00
</td>
<td class="doprawej bezprawej">
2,30
</td>
</tr>
<tr valign="top">
<td class="doprawej bezlewej">
PLBRSTM00015
</td>
<td class="doprawej">
P
</td>
<td class="doprawej">
LA
</td>
<td class="doprawej">
180
</td>
<td class="doprawej">
<b>CALATRAVA</b>
</td>
<td class="doprawej">
15 000 000
</td>
<td class="doprawej">
3,45
</td>
<td class="doprawej">
7,93
</td>
<td class="doprawej">
2017-03-31
</td>
<td class="doprawej">
H
</td>
<td class="doprawej">
0,44
</td>
<td class="doprawej">
0,00
</td>
<td class="doprawej bezprawej">
0,00
</td>
</tr></table>"""
df= pd.read_html(html, header=None)[0]
print(df)
The output was:
0 1 2 3 4 5 6 7 \
0 AT00BUWOG001 P NaN 142 BUWOG 124 184 779 16 019,84 12 476,29
1 PLBRSTM00015 P LA 180 CALATRAVA 15 000 000 345 793
8 9 10 11 12
0 2018-07-31 H 128 1400 230
1 2017-03-31 H 44 0 0
Upvotes: 1