user2981952
user2981952

Reputation: 41

extract data from website pandas read_html

I'm trying to extract data from website URL

The table has a span tag which is messing the data extraction, the table value is concatenated with the span tag, I want to extract both the cell content and span tag in separate cells, any help would be greatly appreciated

Here is the code

import pandas as pd

url = "https://www.sqimway.com/lte_band.php"

lte_band = pd.read_html(url)

lte_band[0]

enter image description here

Upvotes: 1

Views: 148

Answers (1)

Mark Moretto
Mark Moretto

Reputation: 2348

If you have pandas 0.24+, you can use pandas.MultiIndex.to_flat_index() and then map out unique values to each column name.

# Set a new DataFrame variable.
df = lte_band[0]

# Note: We will have to sort on the tuple index to retain order.
df.columns = list(map(lambda q: " ".join(sorted(set(q), key = q.index)), df.columns.to_flat_index()))

Output of df.columns:

Index(['Band', 'Name', 'Mode', 'Downlink (MHz) Low Earfcn',
       'Downlink (MHz) Middle Earfcn', 'Downlink (MHz) High Earfcn',
       'BandwidthDL/UL (MHz)', 'Uplink (MHz) Low Earfcn',
       'Uplink (MHz) Middle Earfcn', 'Uplink (MHz) High Earfcn',
       'Duplex spacing(MHz)', 'Geographicalarea', '3GPPrelease',
       'Channel bandwidth (MHz) 1.4', 'Channel bandwidth (MHz) 3',
       'Channel bandwidth (MHz) 5', 'Channel bandwidth (MHz) 10',
       'Channel bandwidth (MHz) 15', 'Channel bandwidth (MHz) 20'],
      dtype='object')

Formatted:

Band
Name
Mode
Downlink (MHz) Low Earfcn
Downlink (MHz) Middle Earfcn
Downlink (MHz) High Earfcn
BandwidthDL/UL (MHz)
Uplink (MHz) Low Earfcn
Uplink (MHz) Middle Earfcn
Uplink (MHz) High Earfcn
Duplex spacing(MHz)
Geographicalarea
3GPPrelease
Channel bandwidth (MHz) 1.4
Channel bandwidth (MHz) 3
Channel bandwidth (MHz) 5
Channel bandwidth (MHz) 10
Channel bandwidth (MHz) 15
Channel bandwidth (MHz) 20

Upvotes: 1

Related Questions