trazoM
trazoM

Reputation: 324

Shuffle pandas column while avoiding a condition

I have a dataframe that shows 2 sentences are similar. This dataframe has a 3rd relationship column which also contains some strings. This 3rd column shows how similar the texts are. For instance:
P for Plant, V for Vegetables and F for Fruits. Also,
A for Animal, I for Insects and M for Mammals.

data = {'Text1': ["All Vegetables are Plants",
                   "Cows are happy",
                   "Butterflies are really beautiful",
                   "I enjoy Mangoes",
                   "Vegetables are green"],
        'Text2': ['Some Plants are good Vegetables',
                  'Cows are enjoying',
                  'Beautiful butterflies are delightful to watch',
                  'Mango pleases me',
                  'Spinach is green'],
        'Relationship': ['PV123', 'AM4355', 'AI784', 'PF897', 'PV776']}

df = pd.DataFrame(data)

print(df)

>>>
Text1 Text2 Relationship
0 All Vegetables are Plants Some Plants are good Vegetables PV123
1 Cows eat grass Grasses are cow's food AM4355
2 Butterflies are really beautiful Beautiful butterflies are delightful to watch AI784
3 I enjoy Mangoes Mango pleaases me PF897
4 Vegetables are green Spinach is green PV776

I desire to train a BERT model on this data. However, I also need to create examples of dissimilar sentences. My solution is to give a label of 1 to the dataset as it is and then shuffle Text2 and give it a label of 0. The problem is that I can't really create good dissimilar examples just by random shuffling without making use of the "Relationship" column.

How can I shuffle my data so I can avoid texts like All Vegetables are Plants and Spinach is green appearing on the same row on Text1 and Text2 respectively?

Upvotes: 0

Views: 76

Answers (2)

trazoM
trazoM

Reputation: 324

I resolved this by:

  1. Creating a new column with the first 2 letters from the relationship column.
  2. Used this new column to create a multi-index. A groupby on this new column should work hear as well.
  3. For each group, I populated Text2 using texts from other groups.
  4. I concatenated back all my newly modified groups.

With this, I was able to really create semantically dissimilar pairs.

Upvotes: 0

Rawson
Rawson

Reputation: 2822

There may be a more efficient method somewhere, but this will work. The logic is creating a single column of text, taking 2 random samples of these and concatenating. Those with matching relationships (letters only) will be dropped, and an intersection of letters for the two strings is created as the new relationship (note this won't include relationships, as it may miss those characteristics not matching in the initial dataframe).

# 1 column of all text, rather than two
df1 = pd.concat([df[["Text1", "Relationship"]].rename(columns={"Text1": "Text"}),
           df[["Text2", "Relationship"]].rename(columns={"Text2": "Text"})],
          ignore_index=True, axis=0)

# Get letters only for relationships
df1.Relationship = df1["Relationship"].str.extract('^([a-zA-Z]+)', expand=False)

# take 2 random samples and concatenate
out = pd.concat([df1.sample(10).reset_index(drop=True),
                 df1.sample(10).reset_index(drop=True)],
                axis=1, ignore_index=True)
# filter for not equal characteristics only
out = out.loc[out[1].ne(out[3])]

# get similar characteristics (based on intersection of letters)
out["Relationship"] = out.apply(lambda row: "".join(list(set(row[1]) & set(row[3]))), axis=1)

# required columns only
out = out[[0, 2, "Relationship"]].rename(columns={0: "Text1", 2: "Text2"})

Example output:

out.to_dict()
# Out[]:
# {'Text1': {0: 'Cows are enjoying',
#   2: 'Cows are happy',
#   3: 'Spinach is green',
#   4: 'Vegetables are green',
#   5: 'I enjoy Mangoes',
#   6: 'Beautiful butterflies are delightful to watch',
#   7: 'Mango pleases me'},
#  'Text2': {0: 'Vegetables are green',
#   2: 'I enjoy Mangoes',
#   3: 'Beautiful butterflies are delightful to watch',
#   4: 'Mango pleases me',
#   5: 'Cows are enjoying',
#   6: 'Some Plants are good Vegetables',
#   7: 'Cows are happy'},
#  'Relationship': {0: '', 2: '', 3: '', 4: "'P'", 5: '', 6: '', 7: ''}}

Upvotes: 1

Related Questions