Shuffle pandas column while avoiding a condition

Question

I have a dataframe that shows 2 sentences are similar. This dataframe has a 3rd relationship column which also contains some strings. This 3rd column shows how similar the texts are. For instance:
P for Plant, V for Vegetables and F for Fruits. Also,
A for Animal, I for Insects and M for Mammals.

data = {'Text1': ["All Vegetables are Plants",
                   "Cows are happy",
                   "Butterflies are really beautiful",
                   "I enjoy Mangoes",
                   "Vegetables are green"],
        'Text2': ['Some Plants are good Vegetables',
                  'Cows are enjoying',
                  'Beautiful butterflies are delightful to watch',
                  'Mango pleases me',
                  'Spinach is green'],
        'Relationship': ['PV123', 'AM4355', 'AI784', 'PF897', 'PV776']}

df = pd.DataFrame(data)

print(df)

>>>

	Text1	Text2	Relationship
0	All Vegetables are Plants	Some Plants are good Vegetables	PV123
1	Cows eat grass	Grasses are cow's food	AM4355
2	Butterflies are really beautiful	Beautiful butterflies are delightful to watch	AI784
3	I enjoy Mangoes	Mango pleaases me	PF897
4	Vegetables are green	Spinach is green	PV776

I desire to train a BERT model on this data. However, I also need to create examples of dissimilar sentences. My solution is to give a label of 1 to the dataset as it is and then shuffle Text2 and give it a label of 0. The problem is that I can't really create good dissimilar examples just by random shuffling without making use of the "Relationship" column.

How can I shuffle my data so I can avoid texts like All Vegetables are Plants and Spinach is green appearing on the same row on Text1 and Text2 respectively?

trazoM · Accepted Answer

I resolved this by:

Creating a new column with the first 2 letters from the relationship column.
Used this new column to create a multi-index. A groupby on this new column should work hear as well.
For each group, I populated Text2 using texts from other groups.
I concatenated back all my newly modified groups.

With this, I was able to really create semantically dissimilar pairs.

Shuffle pandas column while avoiding a condition

Answers (2)

Related Questions