Gabriela M
Gabriela M

Reputation: 615

Read url as pandas dataframe with column names (python3)

I have read several questions regarding this topic, but nothing seems to work for me.

I want to retrieve the data from this page "http://archive.ics.uci.edu/ml/machine-learning-databases/statlog/heart/heart.dat" with certain names for the columns.

My code is the following, which does not let me assign names to the columns of the data, because everything is in a single column:

import pandas as pd
import io
import requests
url="http://archive.ics.uci.edu/ml/machine-learningdatabases/statlog/heart/heart.dat"
s=requests.get(url).content
header_row = ['age','sex','chestpain','restBP','chol','sugar','ecg','maxhr','angina','dep','exercise','fluor','thal','diagnosis']
c=pd.read_csv(io.StringIO(s.decode('utf-8')), names=header_row)
print(c)

The output is:

     age  sex  chestpain  \
0    70.0 1.0 4.0 130.0 322.0 0.0 2.0 109.0 0.0 2.4...  NaN        NaN   
1    67.0 0.0 3.0 115.0 564.0 0.0 2.0 160.0 0.0 1.6...  NaN        NaN   
2    57.0 1.0 2.0 124.0 261.0 0.0 0.0 141.0 0.0 0.3...  NaN        NaN   
3    64.0 1.0 4.0 128.0 263.0 0.0 0.0 105.0 1.0 0.2...  NaN        NaN

What do I have to do to achieve my goal?

Thank you very much!!!

Upvotes: 1

Views: 3007

Answers (1)

Clock Slave
Clock Slave

Reputation: 7967

The link you provided was missing a hyphen. I've corrected that in my answer. Basically you need to decode the s string into utf-8, then split it on \n to get each row and then split each row on white space to get each value separately. This will give you a nested list representation of the data set which you can convert to a pandas dataframe and thereafter you can assign the column names.

import pandas as pd
import io
import requests
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/heart/heart.dat"
s = requests.get(url).content
s = s.decode('utf-8')
s_rows = s.split('\n')
s_rows_cols = [each.split() for each in s_rows]
header_row = ['age','sex','chestpain','restBP','chol','sugar','ecg','maxhr','angina','dep','exercise','fluor','thal','diagnosis']
c = pd.DataFrame(s_rows_cols, columns = header_row)
c.head()

Upvotes: 1

Related Questions