paulsef11
paulsef11

Reputation: 648

Why do I get Pandas data frame with only one column vs Series?

I've noticed single-column data frames a couple of times to much chagrin (examples below); but in most other cases a one-column data frame would just be a Series. Is there any rhyme or reason as to why a one column DF would be returned?

Examples:

1) when indexing columns by a boolean mask where the mask only has one true value:

df = pd.DataFrame([list('abc'), list('def')], columns = ['foo', 'bar', 'tar'])
mask = [False, True, False]
type(df.ix[:,mask])

2) when setting an index on DataFrame that only has two columns to begin with:

df = pd.DataFrame([list('ab'), list('de'), list('fg')], columns = ['foo', 'bar']
type(df.set_index('foo'))

I feel like if I'm expecting a DF with only one column, I can deal with it by just calling

pd.Series(df.values().ravel(), index = df.index)

But in most other cases a one-column data frame would just be a Series. Is there any rhyme or reason as to why a one column DF would be returned?

Upvotes: 6

Views: 9936

Answers (1)

BrenBarn
BrenBarn

Reputation: 251345

In general, a one-column DataFrame will be returned when the operation could return a multicolumn DataFrame. For instance, when you use a boolean column index, a multicolumn DataFrame would have to be returned if there was more than one True value, so a DataFrame will always be returned, even if it has only one column. Likewise when setting an index, if your DataFrame had more than two columns, the result would still have to be a DataFrame after removing one for the index, so it will still be a DataFrame even if it has only one column left.

In contrast, if you do something like df.ix[:,'col'], it returns a Series, because there is no way that passing one column name to select can ever select more than one column.

The idea is that doing an operation should not sometimes return a DataFrame and sometimes a Series based on features specific to the operands (i.e., how many columns they happen to have, how many values are True in your boolean mask). When you do df.set_index('col'), it's simpler if you know that you will always get a DataFrame, without having to worry about how many columns the original happened to have.

Note that there is also the DataFrame method .squeeze() for turning a one-column DataFrame into a Series.

Upvotes: 7

Related Questions