Reputation: 115
My question is about the FOR loop below and it's something I see being used by prominent data scientists on Kaggle. However it doesn't seem to work for me.
Python 3.66. Pandas 0.23.4
train = pd.DataFrame({'id': [2, 3, 1], 'time':['2017-04-17 22:23:22', '2018-05-22 14:20:00', '2017-01-09 08:02:14']})
test = pd.DataFrame({'id': [2, 3, 1], 'time':['2017-04-17 22:23:22', '2018-05-22 14:20:00', '2017-01-09 08:02:14']})
train
>>> id time
>>> 0 2 2017-04-17 22:23:22
>>> 1 3 2018-05-22 14:20:00
>>> 2 1 2017-01-09 08:02:14
train.sort_values('time', ascending=True)
>>> id time
>>> 2 1 2017-01-09 08:02:14
>>> 0 2 2017-04-17 22:23:22
>>> 1 3 2018-05-22 14:20:00
for data in [train, test]:
data = data.sort_values('time', ascending=True)
train
>>> id time
>>> 0 2 2017-04-17 22:23:22
>>> 1 3 2018-05-22 14:20:00
>>> 2 1 2017-01-09 08:02:14
Upvotes: 2
Views: 1210
Reputation: 164783
Sort it in a FOR loop - why does this not work?
Because your for
loop doesn't bind your newly defined variable data
to the objects within your [train, test]
. You are redefining data
within each loop without changing train
or test
.
Instead, you can use sequence unpacking:
train, test = (df.sort_values('time') for df in (train, test))
Or, use enumerate
in a for
loop:
data = [train, test]
for idx, df in enumerate(data):
data[idx] = df.sort_values('time')
Then refer to your dataframes via index, i.e. data[0]
, data[1]
.
Or, use a dictionary and iterate items:
data = {'train': train, 'test': test}
for key, df in d.items():
data[key] = df.sort_values('time')
Then refer to your dataframes via key, i.e. data['train']
, data['test']
.
Upvotes: 4
Reputation: 348
There is a simple way, just use inplace=True parameter for sorting
>>> train = pd.DataFrame({'id': [2, 3, 1], 'time':['2017-04-17 22:23:22', '2018-05-22 14:20:00', '2017-01-09 08:02:14']})
>>> test = pd.DataFrame({'id': [2, 3, 1], 'time':['2017-04-17 22:23:22', '2018-05-22 14:20:00', '2017-01-09 08:02:14']})
>>> for data in [train, test]:
data.sort_values('time', ascending=True, inplace=True)
>>> test
id time
2 1 2017-01-09 08:02:14
0 2 2017-04-17 22:23:22
1 3 2018-05-22 14:20:00
>>> train
id time
2 1 2017-01-09 08:02:14
0 2 2017-04-17 22:23:22
1 3 2018-05-22 14:20:00
If you want the index to update again just add reset_indext()
data.sort_values('time', ascending=True, inplace=True)
data.reset_index(inplace=True, drop=True)
Upvotes: 3