ihopethiswillfi
ihopethiswillfi

Reputation: 115

Can't modify Pandas DataFrame while iterating

My question is about the FOR loop below and it's something I see being used by prominent data scientists on Kaggle. However it doesn't seem to work for me.

Python 3.66. Pandas 0.23.4

setup

train = pd.DataFrame({'id': [2, 3, 1], 'time':['2017-04-17 22:23:22', '2018-05-22 14:20:00', '2017-01-09 08:02:14']})
test = pd.DataFrame({'id': [2, 3, 1], 'time':['2017-04-17 22:23:22', '2018-05-22 14:20:00', '2017-01-09 08:02:14']})
train

>>>         id  time  
>>>   0     2   2017-04-17 22:23:22
>>>   1     3   2018-05-22 14:20:00
>>>   2     1   2017-01-09 08:02:14

Sort it (this works)

train.sort_values('time', ascending=True)

>>>     id  time
>>> 2   1   2017-01-09 08:02:14
>>> 0   2   2017-04-17 22:23:22
>>> 1   3   2018-05-22 14:20:00

Sort it in a FOR loop - why does this not work?

for data in [train, test]:
    data = data.sort_values('time', ascending=True)
train

>>>     id  time
>>> 0   2   2017-04-17 22:23:22
>>> 1   3   2018-05-22 14:20:00
>>> 2   1   2017-01-09 08:02:14

Upvotes: 2

Views: 1210

Answers (2)

jpp
jpp

Reputation: 164783

Sort it in a FOR loop - why does this not work?

Because your for loop doesn't bind your newly defined variable data to the objects within your [train, test]. You are redefining data within each loop without changing train or test.

Instead, you can use sequence unpacking:

train, test = (df.sort_values('time') for df in (train, test))

Or, use enumerate in a for loop:

data = [train, test]
for idx, df in enumerate(data):
    data[idx] = df.sort_values('time')

Then refer to your dataframes via index, i.e. data[0], data[1].

Or, use a dictionary and iterate items:

data = {'train': train, 'test': test}

for key, df in d.items():
    data[key] = df.sort_values('time')

Then refer to your dataframes via key, i.e. data['train'], data['test'].

Upvotes: 4

Suresh Mali
Suresh Mali

Reputation: 348

There is a simple way, just use inplace=True parameter for sorting

>>> train = pd.DataFrame({'id': [2, 3, 1], 'time':['2017-04-17 22:23:22', '2018-05-22 14:20:00', '2017-01-09 08:02:14']})
>>> test = pd.DataFrame({'id': [2, 3, 1], 'time':['2017-04-17 22:23:22', '2018-05-22 14:20:00', '2017-01-09 08:02:14']})
>>> for data in [train, test]:
     data.sort_values('time', ascending=True, inplace=True)

>>> test
   id                 time
2   1  2017-01-09 08:02:14
0   2  2017-04-17 22:23:22
1   3  2018-05-22 14:20:00
>>> train
   id                 time
2   1  2017-01-09 08:02:14
0   2  2017-04-17 22:23:22
1   3  2018-05-22 14:20:00

If you want the index to update again just add reset_indext()

data.sort_values('time', ascending=True, inplace=True)
data.reset_index(inplace=True, drop=True)

Upvotes: 3

Related Questions