Reputation: 275
I'd like to import a dataset from the UCL depository as a pandas data frame.
The problem is that the bulk of data sits in one file (wdbc.data) and the column names in another (wdbc.names), and I don't know how to read them in together as a single pandas data frame.
Thanks for any help!
import pandas as pd
df1 = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data')
df2 = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.names')
df_final = df1.append(df2)
ERROR MESSAGE:
ParserError: Error tokenizing data. C error: Expected 1 fields in line 7, saw 3
Upvotes: 0
Views: 736
Reputation: 11
The "wdbc.names" file doesn't have useful data for column names. So, I found column names from Kaggle referring to the same dataset. https://www.kaggle.com/datasets/uciml/breast-cancer-wisconsin-data. Then you can try like below.
import pandas as pd
names = ['id','diagnosis','radius_mean','texture_mean','perimeter_mean','area_mean','smoothness_mean','compactness_mean','concavity_mean','concave points_mean','symmetry_mean','fractal_dimension_mean','radius_se','texture_se','perimeter_se','area_se','smoothness_se','compactness_se','concavity_se','concave points_se','symmetry_se','fractal_dimension_se','radius_worst','texture_worst','perimeter_worst','area_worst','smoothness_worst','compactness_worst','concavity_worst','concave points_worst','symmetry_worst','fractal_dimension_worst']
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data')
df.columns = names
df.head()
Upvotes: 1