Vic Nicethemer
Vic Nicethemer

Reputation: 1111

Pandas shuffle column values doesn't work

I have csv with 2 columns: "Context", "Utterance".

I need to shuffle (make random order) "Context" column values. Note, that not full row to shuffle, but only 1 column, second column "Utterance" order remains the same.

For this i used: answers (shuffling/permutating a DataFrame in pandas)

  train_df2 = pd.read_csv("./data/nolabel.csv", encoding='utf-8', sep=",")
  train_df2.drop('Utterance', axis=1, inplace=True) # delete 'Utterance'
  train_df2 = train_df2.sample(frac=1) # shuffle
  train_df2['Utterance'] = train_moscow_df['Utterance'] # add back 'Utterance'
  train_df2["Label"] = 0 
  header = ["Context", "Utterance", "Label"] # 

  train_df2.to_csv('./data/label0.csv', columns = header, encoding='utf-8', index = False)

BUT, result is bad: i got a full rows shuffle, but corresponding values from 2 columns still the same.

I need that 1st value from 1st column correspond to random value from 2nd. (Also tried from sklearn.utils import shuffle but no luck too)

Upvotes: 3

Views: 7635

Answers (1)

EdChum
EdChum

Reputation: 394061

the problem is that when the df is shuffled the index is shuffled but then you add the original column back and it will align on the original index, you can call reset_index so that it doesn't do this:

train_df2 = train_df2.sample(frac=1) # shuffle
train_df2.reset_index(inplace=True, drop=True)
train_df2['Utterance'] = train_moscow_df['Utterance'] # add back 'Utterance'

Example:

In [196]:
# setup
df = pd.DataFrame(np.random.randn(5,2), columns=list('ab'))
df

Out[196]:
          a         b
0  0.116596 -0.684748
1 -0.133922 -0.969933
2  0.103551  0.912101
3 -0.279751 -0.348443
4  1.453413  0.062378

now we drop and shuffle as before, note the index values

In [197]:
a = df.drop('b', axis=1)
a = a.sample(frac=1)
a

Out[197]:
          a
3 -0.279751
0  0.116596
1 -0.133922
4  1.453413
2  0.103551

now reset

In [198]:    
a.reset_index(inplace=True, drop=True)
a

Out[198]:
          a
0 -0.279751
1  0.116596
2 -0.133922
3  1.453413
4  0.103551

we can add the column back but retain shuffled order:

In [199]:
df['b'] = a['b']
df

Out[199]:
          a         b
0 -0.279751 -0.684748
1  0.116596 -0.969933
2 -0.133922  0.912101
3  1.453413 -0.348443
4  0.103551  0.062378

Upvotes: 4

Related Questions