everial
everial

Reputation: 312

Setting pandas.DataFrame string dtype (not file based)

I'm having trouble with using pandas.DataFrame's constructor and using the dtype argument. I'd like to preserve string values, but the following snippets always convert to a numeric type and then yield NaNs.

from __future__ import unicode_literals
from __future__ import print_function


import numpy as np
import pandas as pd


def main():
    columns = ['great', 'good', 'average', 'bad', 'horrible']
    # minimal example, dates are coming (as strings) from some
    # non-file source.
    example_data = {
        'alice': ['', '', '', '2016-05-24', ''],
        'bob': ['', '2015-01-02', '', '', '2012-09-15'],
        'eve': ['2011-12-31', '', '1998-08-13', '', ''],
    }

    # first pass, yields dataframe full of NaNs
    df = pd.DataFrame(data=example_data, index=example_data.keys(),
        columns=columns, dtype=str) #or string, 'str', 'string', 'object'
    print(df.dtypes)
    print(df)
    print()

    # based on https://github.com/pydata/pandas/blob/master/pandas/core/frame.py
    # and https://github.com/pydata/pandas/blob/37f95cef85834207db0930e863341efb285e38a2/pandas/types/common.py
    # we're ultimately feeding dtype to numpy's dtype, so let's just use that:
    #     (using np.dtype('S10') and converting to str doesn't work either)
    df = pd.DataFrame(data=example_data, index=example_data.keys(),
        columns=columns, dtype=np.dtype('U'))
    print(df.dtypes)
    print(df) # still full of NaNs... =(



if __name__ == '__main__':
    main()

What value(s) of dtypes will preserve strings in the data frame?

for reference:

$ python --version

2.7.12

$ pip2 list | grep pandas

pandas (0.18.1)

$ pip2 list | grep numpy

numpy (1.11.1)

Upvotes: 1

Views: 225

Answers (2)

Alicia Garcia-Raboso
Alicia Garcia-Raboso

Reputation: 13913

For the particular case in the OP, you can use the DataFrame.from_dict() constructor (see also the Alternate Constructors section of the DataFrame documentation) .

from __future__ import unicode_literals
from __future__ import print_function

import pandas as pd

columns = ['great', 'good', 'average', 'bad', 'horrible']
example_data = {
    'alice': ['', '', '', '2016-05-24', ''],
    'bob': ['', '2015-01-02', '', '', '2012-09-15'],
    'eve': ['2011-12-31', '', '1998-08-13', '', ''],
}
df = pd.DataFrame.from_dict(example_data, orient='index')
df.columns = columns

print(df.dtypes)
# great       object
# good        object
# average     object
# bad         object
# horrible    object
# dtype: object

print(df)
#             great        good     average         bad    horrible
# bob                2015-01-02                          2012-09-15
# eve    2011-12-31              1998-08-13                        
# alice                                      2016-05-24     

You can even specify dtype=str in DataFrame.from_dict() — though it is not necessary in this example.

EDIT: The DataFrame constructor interprets a dictionary as a collection of columns:

print(pd.DataFrame(example_data))

#         alice         bob         eve
# 0                          2011-12-31
# 1              2015-01-02            
# 2                          1998-08-13
# 3  2016-05-24                        
# 4              2012-09-15            

(I'm dropping the data=, since data is the first argument in the function's signature anyway). Your code confuses rows and columns:

print(pd.DataFrame(example_data, index=example_data.keys(), columns=columns))

#       great good average  bad horrible
# alice   NaN  NaN     NaN  NaN      NaN
# bob     NaN  NaN     NaN  NaN      NaN
# eve     NaN  NaN     NaN  NaN      NaN   

(though I'm not exactly sure how it ends up giving you a DataFrame of NaNs). It would be correct to do

print(pd.DataFrame(example_data, columns=example_data.keys(), index=columns))

#                alice         bob         eve
# great                             2011-12-31
# good                  2015-01-02            
# average                           1998-08-13
# bad       2016-05-24                        
# horrible              2012-09-15   

Specifying the column names is actually unnecessary — they are already parsed from the dictionary:

print(pd.DataFrame(example_data, index=columns))

#                alice         bob         eve
# great                             2011-12-31
# good                  2015-01-02            
# average                           1998-08-13
# bad       2016-05-24                        
# horrible              2012-09-15                     

What you want is actually the transpose of this — so you can also take said transpose!

print(pd.DataFrame(data=example_data, index=columns).T)

#             great        good     average         bad    horrible
# alice                                      2016-05-24            
# bob                2015-01-02                          2012-09-15
# eve    2011-12-31              1998-08-13               

Upvotes: 1

AlvaroP
AlvaroP

Reputation: 400

This is not a proper answer, but while you get one by someone else, I've noticed that using the read_csv function everything works.

So if you place your data in a .csv file called myData.csv, like this:

great,good,average,bad,horrible
alice,,,,2016-05-24,
bob,,2015-01-02,,,2012-09-15
eve,2011-12-31,,1998-08-13,,

and do

df = pd.read_csv('blablah/myData.csv')

it will keep the strings as they are!

        great      good     average       bad      horrible
alice    NaN        NaN       NaN     2016-05-24      NaN
bob      NaN    2015-01-02    NaN         NaN     2012-09-15
eve   2011-12-31    NaN    1998-08-13     NaN         NaN

if you want, the empty values can be put as an space in the csv file or any other character/marker.

Upvotes: 0

Related Questions