Reputation: 2827
I'm trying to use scikit with some data that are in one spreadsheet (.xlsx). To achieve this I'm using Pandas to read the spreadsheet and then I was going to use numpy to use scikit.
The problem here is that when I covert my DF structure to numpy, I lose almost all the data! I think it is because it doesn't have columns names. There are just raw data. EX:
28.7967 16.0021 2.6449 0.3918 0.1982
31.6036 11.7235 2.5185 0.5303 0.3773
162.052 136.031 4.0612 0.0374 0.0187
My code so far:
def split_data():
test_data = pd.read_excel('magic04.xlsx', sheetname=0, skip_footer=16020)
#code below prints correctly the data
print test_data.iloc[:, 0:10]
#none of the code below work as expected
test1 = np.array(test_data.iloc[:, 0:10])
test2 = test_data.as_matrix()
I'm really lost here. Any help would be very welcome...
Upvotes: 2
Views: 86
Reputation: 76297
I'd suggest that you use header=None
in read_excel
. See the following:
df = pd.read_excel('stuff.xlsx')
>> df
28.7967 16.0021 2.6449 0.3918 0.1982
0 31.6036 11.7235 2.5185 0.5303 0.3773
1 162.0520 136.0310 4.0612 0.0374 0.0187
>> df.ix[:, 1: 2]
0
1
Versus:
df = pd.read_excel('stuff.xlsx', header=None)
>> df
0 1 2 3 4
0 28.7967 16.0021 2.6449 0.3918 0.1982
1 31.6036 11.7235 2.5185 0.5303 0.3773
2 162.0520 136.0310 4.0612 0.0374 0.0187
>> df.ix[:, 1: 2]
1 2
0 16.0021 2.6449
1 11.7235 2.5185
2 136.0310 4.0612
Upvotes: 2