J. Cole
J. Cole

Reputation: 47

Loading CSV with Pandas - Array's not parsed correctly

I have a dataset which I transformed to CSV as potential input for a keras auto encoder. The loading of the CSV works flawless with pandas.read_csv() but the data types are not correct.

The csv solely contains two colums: label and features whereas the label column contains strings and the features column arrays with signed integers ([-1, 1]). So in general pretty simple structure.

To get two different dataframes for further processing I created them via:

labels = pd.DataFrame(columns=['label'], data=csv_data, dtype='U') and

features = pd.DataFrame(columns=['features'], data=csv_data)

in both cases I got wrong datatypes as both are marked as object typed dataframes. What am I doing wrong? For the features it is even harder because the parsing returns me a pandas.sequence that contains the array as string: ['[1, ..., 1]'].

So I tried a tedious workaround by parsing the string back to an numpy array via .to_numpy() a python cast for every element and than an np.assarray() - but the type of the dataframe is still incorrect. I think this could not be the general approach how to solve this task. As I am fairly new to pandas I checked some tutorials and the API but in most cases a cell in a dataframe rather contains a single value instead of a complete array. Maybe my overall design of the dataframe ist just not suitable for this task.

Any help appreacheated!

Upvotes: 0

Views: 607

Answers (2)

J. Cole
J. Cole

Reputation: 47

The input csv was formatted incorrectly, therefore the parsing was accurate but not what i intended. I expanded the real columns and skipped the header to have a column for every array entry - now panda recognize the types and the correct dimensions.

Upvotes: 0

Equinox
Equinox

Reputation: 6748

You are reading the file as string but you have a python list as a column you need to evaluate it to get the list. I am not sure of the use case but you can split the labels for a more readable dataframe

import pandas as pd
features = ["featurea","featureb","featurec","featured","featuree"]
labels = ["[1,0,1,1,1,1]","[1,0,1,1,1,1]","[1,0,1,1,1,1]","[1,0,1,1,1,1]","[1,0,1,1,1,1]"]

df = pd.DataFrame(list(zip(features, labels)), 
               columns =['Features', 'Labels']) 

import ast
#convert Strings to lists
df['Labels'] = df['Labels'].map(ast.literal_eval)
df.index  = df['Features']

#Since list itself might not be useful you can split and expand it to multiple columns
new_df = pd.DataFrame(df['Labels'].values.tolist(),index= df.index)

Output

          0  1  2  3  4  5
Features
featurea  1  0  1  1  1  1
featureb  1  0  1  1  1  1
featurec  1  0  1  1  1  1
featured  1  0  1  1  1  1
featuree  1  0  1  1  1  1

Upvotes: 1

Related Questions