Karn Kumar
Karn Kumar

Reputation: 8816

How Pandas read_html can get only selected columns from entire DataFrame

I'm trying to extract the particular columns from the html page where my html data looks like below.

1) HTML DATA Format

            VM Name           User Name        Image Name                           Network  VCPUS  Memory(GB)  Disk(GB) Tenant     Region      KVM Host Power State                          URL               Created
0      dbsw-powerbi  [email protected]           unknown   {u'VLAN181': [u'192.168.57.91']}      4          16       100    APP  DBS-AP-IN  dbs-appkvm03          On  https://compute.ezy.com  2018-08-02T10:30:07Z
1           pciedip  [email protected]     dbsVDI-RHEL65   {u'VLAN181': [u'192.168.57.37']}      4          32       200    APP  DBS-AP-IN  dbs-appkvm01          On  https://compute.ezy.com  2018-04-18T06:39:38Z
2  dbs-spbdatasync1  [email protected]    dbsVDI-RHEL510  {u'VLAN181': [u'192.168.57.156']}      1           8        50    APP  DBS-AP-IN     dbs-kvm13          On  https://compute.ezy.com  2018-04-05T09:51:29Z
3      dbsw-russian  [email protected]  dbsVDI-WIN764-V1  {u'VLAN181': [u'192.168.57.216']}      1           4       100    APP  DBS-AP-IN  dbs-appkvm01          On  https://compute.ezy.com  2018-04-02T06:25:25Z
4   dbs-spbdatasync  [email protected]    dbsVDI-RHEL510  {u'VLAN181': [u'192.168.57.233']}      1           8        50    APP  DBS-AP-IN     dbs-kvm13          On  https://compute.ezy.com  2018-04-02T05:03:03Z

I'm simply trying pandas read_html to get the DataFrame but unable to get the understanding to get the particular columns from the DataFrame. I need to selected columns ['VM Name', 'User Name', 'Network', 'Region'] out of the 13 column.

2) code snippet

from __future__ import print_function
from signal import signal, SIGPIPE, SIG_DFL
signal(SIGPIPE,SIG_DFL)
import pandas as pd
##### Python pandas, widen output display to see more columns. ####
pd.set_option('display.height', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('expand_frame_repr', True)

# print(pd.read_excel('ssd.xlsx'))
# Data = pd.read_html('http://openstacksearch/vm_list.html', header=0, flavor='bs4', index_col=['VM Name', 'User Name', 'Network', 'Region'])
Data = pd.read_html('http://openstacksearch/vm_list.html', header=0, flavor='bs4')
print(Data[0].head())

Upvotes: 1

Views: 3983

Answers (2)

Karn Kumar
Karn Kumar

Reputation: 8816

I got solution while selecting the DataFrame from the processed read_html and then choosing the desired columns with multi-index based approach. Thanks to the Adrew for driving ideas around this..

So, code looks like below ... may be helpful for someone

import pandas as pd
##### Python pandas, widen output display to see more columns. ####
pd.set_option('display.height', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('expand_frame_repr', True)
###### Data Extraction ##################
'''
pd.read_html returns you a list with one element and that 
element is the pandas dataframe, i.e.
Data = pd.read_html('url') will produce a list
Data[0]  Will return a pandas DataFrame
'''
Data = pd.read_html('http://openstacksearch/vm_list.html', header=0, flavor='bs4')[0]
Data1 = Data[['VM Name', 'User Name', 'Network', 'Region']]
print(Data1)

Upvotes: 3

Andrew
Andrew

Reputation: 970

to select a subset of columns you could use

Data = pd.read_html('http://openstacksearch/vm_list.html', header=0, flavor='bs4')
Data = Data[['VM Name', 'User Name', 'Network', 'Region']]

Upvotes: 1

Related Questions