Pavan Suvarna
Pavan Suvarna

Reputation: 501

how to read td contents from the html page and convert to Dataframe

The bellow is my content

`'<tablecellspacing="0"cellpadding="4"rules="all"id="DataGrid1"style="background-color:White;border-color:#3366CC;border-width:1px;border-style:None;height:65px;width:268px;border-collapse:collapse;"><trstyle="color:#CCCCFF;background-color:#003399;font-weight:bold;"><td>State</td><td>Centre</td><td>Variety</td><td>Unit</td><td>03/01/2020</td><td>10/01/2020</td><td>17/01/2020</td><td>24/01/2020</td><td>31/01/2020</td><td>07/02/2020</td><td>14/02/2020</td><td>21/02/2020</td><td>28/02/2020</td><td>06/03/2020</td><td>13/03/2020</td><td>20/03/2020</td><td>27/03/2020</td><td>03/04/2020</td><td>10/04/2020</td><td>17/04/2020</td></tr><trstyle="color:#003399;background-color:White;white-space:nowrap;"><tdstyle="background-color:#3CFFCE;font-weight:bold;">Apple</td><td></td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr><trstyle="color:#003399;background-color:White;white-space:nowrap;"><td>AndhraPradesh</td><td>Chittoor</td><td>Deliciousmediumsize</td><td>Kg.</td><td>130.00</td><td>130.00</td><td>140.00</td><td>140.00</td><td>140.00</td><td>140.00</td><td>150.00</td><td>150.00</td><td>140.00</td><td>150.00</td><td>150.00</td><td>150.00</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr><trstyle="color:#003399;background-color:White;white-space:nowrap;"><td>AndhraPradesh</td><td>Guntur</td><td>Deliciousmediumsize</td><td>Kg.</td><td>100.00</td><td>100.00</td><td>100.00</td><td>110.00</td><td>110.00</td><td>120.00</td><td>120.00</td><td>120.00</td><td>120.00</td><td>120.00</td><td>120.00</td><td>120.00</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr><trstyle="color:#003399;background-color:White;white-space:nowrap;"><td>AndhraPradesh</td><td>Kurnool</td><td>Deliciousmediumsize</td><td>Kg.</td><td>110.00</td><td>110.00</td><td>120.00</td><td>120.00</td><td>120.00</td><td>120.00</td><td>120.00</td><td>120.00</td><td>120.00</td><td>120.00</td><td>120.00</td><td>120.00</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr></table>`'

when i print td.text it gives elements but i am not sure how to put that it into a Pandas Dataframe

soup = BeautifulSoup(Content, 'lxml')
TD = soup.findAll('td')

for td in TD:
    print(td.text)

State
Centre
Variety
Unit
03/01/2020
10/01/2020
17/01/2020

AndhraPradesh
Chittoor
Deliciousmediumsize
Kg.
130.00
130.00
130.00
 
AndhraPradesh
Guntur
Deliciousmediumsize
Kg.
100.00
100.00
100.00
 
AndhraPradesh
Kurnool
Deliciousmediumsize
Kg.
110.00
110.00
120.00

My expected outcome(sample)

Date            state        center         variety            Unit  value
03/01/2020  AndhraPradesh   Chittoor       Deliciousmediumsize  kg    130
10/01/2020  AndhraPradesh   Guntur         Deliciousmediumsize  kg    100
17/01/2020  AndhraPradesh   Kurnool        Deliciousmediumsize  kg    110

 Can anyone help me on this

Upvotes: 1

Views: 101

Answers (1)

Kunal Sawant
Kunal Sawant

Reputation: 493

In [149]: Content = Content.replace( 'trstyle' , 'tr style')

In [.  ]: soup = BeautifulSoup(Content, 'lxml')
     ...: TR = soup.findAll('tr')

In [151]: rownum = 0

In [152]: cols, data = [] , []
     ...: for tr in TR:
     ...:     if rownum == 0:
     ...:         for td in tr.findAll('td'):
     ...:             if td.text: cols.append(td.text)
     ...:
     ...:     else:
     ...:         row_data = []
     ...:         for td in tr.findAll('td'):
     ...:             if td.text:
     ...:                 row_data.append(td.text)
     ...:
     ...:         if row_data :data.append(row_data)
     ...:     rownum+=1
     ...:     print('row complete' )
     ...:
row complete
row complete
row complete
row complete
row complete

In [153]: data = data[1:] # because this was some null row with no data



In [154]: df = pd.DataFrame( data = data , columns = cols )

In [155]: id_vars = ['State', 'Centre' , 'Variety' , 'Unit']

In [156]: value_vars = list(set(df.columns ) - set( id_vars) )

In [157]: pd.melt(df, id_vars = id_vars , value_vars = value_vars, var_name = 'Date' )
Out[157]:
            State    Centre              Variety Unit        Date   value
0   AndhraPradesh  Chittoor  Deliciousmediumsize  Kg.  28/02/2020  140.00
1   AndhraPradesh    Guntur  Deliciousmediumsize  Kg.  28/02/2020  120.00
2   AndhraPradesh   Kurnool  Deliciousmediumsize  Kg.  28/02/2020  120.00
3   AndhraPradesh  Chittoor  Deliciousmediumsize  Kg.  10/01/2020  130.00
4   AndhraPradesh    Guntur  Deliciousmediumsize  Kg.  10/01/2020  100.00
5   AndhraPradesh   Kurnool  Deliciousmediumsize  Kg.  10/01/2020  110.00

Upvotes: 2

Related Questions