Pandas: Filling up empty dataframe

Question

I have two questions. First, my filling up the data in the end triggers the following error. Second, since I am not too familiar with ``pandas'', this code is probably really untypical. If you have any improvements, feel free to help make this compact and efficient.

The code is supposed to create a crosswalk between x to y. The database may contain the same x<->y relationship several time. However, it should be unique. For every X, I check if the database is actually correct: if there is more than one relation, they all match to the same y.

Beginning of the crosswalk.csv:

x,y
832,"6231"
0,"00000000"
0,"00000000"
0,"00000000"
0,"00000000"
0,"00000000"
0,"00000000"
840,"6214"
842,"6111"

The code

data = pd.read_csv('data/crosswalk_short.csv')
df = pd.DataFrame(data)

xs = df.x.unique()
result = pd.DataFrame(index=xs)
result.fillna(NaN)

for x in xs:
    ys = df[df.x == x].y
    range = arange(0, len(ys.index))
    ys = ys.reindex(range)

    if (range[-1] > 0 and not isnan(ys[1]) ):
        print 'error!'

    result._ix[x] = ys[0]

The error:

  File "", line 1, in 
    result._ix[x] = ys[0]
TypeError: 'NoneType' object does not support item assignment

Phillip Cloud · Accepted Answer

Part 1

Anything with a single underscore as the first character of a name is generally "private" which in pandas code base really means "subject to change". So, you shouldn't be using _ix for anything. Use loc, iloc, [] syntax, or ix to perform assignment and to select subsets of your data. This error happens because _ix is not instantiated until you call ix (and its value is None until that happens), but this implementation detail is completely irrelevant to you as a user of pandas. Use the public APIs and you usually won't get these kinds of errors.

Also, this line

result.fillna(NaN)

is a no-op because by default fillna returns a copy. If you to update result in place, do

result.fillna(NaN, inplace=True)

This API convention is fairly consistent throughout pandas. That is, for methods where it makes sense to do so, the function signatures have something like

object.method(..., inplace=False)

by default.

Part 2

As for your second question, it looks like you want to check whether all duplicate xs have the same y value. One way to do that is:

df.groupby('x').filter(lambda x: x.count() > 1).groupby('x').y.nunique() == 1

This says:

groupby the 'x' column
give me subsets where there's more than a single label in the groups (repeated values in 'x')
groupby our new de-single-fied 'x' column
tell me whether there's more than a single unique 'y' for each value in 'x'

If 4. is False for any of the groups, that means you have x values repeated, where the y values are different.

Here's an example of this in action (I've modified your original dataset a little bit):

In [94]: df = pd.read_csv(StringIO('''x,y
q832,"6231"
1,"00000000"
1,"00000001"
0,"00000000"
0,"00000000"
0,"00000000"
0,"00000000"
840,"6214"
840,"6111"'''))

In [95]: df.groupby('x').filter(lambda x: x.count() > 1).groupby('x').y.nunique() == 1
Out[95]:
x
0       True
1      False
840    False
dtype: bool

Pandas: Filling up empty dataframe

Answers (1)

Part 1

Part 2

Related Questions