Reputation: 324
I have a dataframe that shows 2 sentences are similar. This dataframe has a 3rd relationship column which also contains some strings. This 3rd column shows how similar the texts are. For instance:
P for Plant, V for Vegetables and F for Fruits. Also,
A for Animal, I for Insects and M for Mammals.
data = {'Text1': ["All Vegetables are Plants",
"Cows are happy",
"Butterflies are really beautiful",
"I enjoy Mangoes",
"Vegetables are green"],
'Text2': ['Some Plants are good Vegetables',
'Cows are enjoying',
'Beautiful butterflies are delightful to watch',
'Mango pleases me',
'Spinach is green'],
'Relationship': ['PV123', 'AM4355', 'AI784', 'PF897', 'PV776']}
df = pd.DataFrame(data)
print(df)
>>>
Text1 | Text2 | Relationship | |
---|---|---|---|
0 | All Vegetables are Plants | Some Plants are good Vegetables | PV123 |
1 | Cows eat grass | Grasses are cow's food | AM4355 |
2 | Butterflies are really beautiful | Beautiful butterflies are delightful to watch | AI784 |
3 | I enjoy Mangoes | Mango pleaases me | PF897 |
4 | Vegetables are green | Spinach is green | PV776 |
I desire to train a BERT model on this data. However, I also need to create examples of dissimilar sentences. My solution is to give a label of 1 to the dataset as it is and then shuffle Text2
and give it a label of 0. The problem is that I can't really create good dissimilar examples just by random shuffling without making use of the "Relationship" column.
How can I shuffle my data so I can avoid texts like All Vegetables are Plants
and Spinach is green
appearing on the same row on Text1
and Text2
respectively?
Upvotes: 0
Views: 76
Reputation: 324
I resolved this by:
With this, I was able to really create semantically dissimilar pairs.
Upvotes: 0
Reputation: 2822
There may be a more efficient method somewhere, but this will work. The logic is creating a single column of text, taking 2 random samples of these and concatenating. Those with matching relationships (letters only) will be dropped, and an intersection of letters for the two strings is created as the new relationship (note this won't include relationships, as it may miss those characteristics not matching in the initial dataframe).
# 1 column of all text, rather than two
df1 = pd.concat([df[["Text1", "Relationship"]].rename(columns={"Text1": "Text"}),
df[["Text2", "Relationship"]].rename(columns={"Text2": "Text"})],
ignore_index=True, axis=0)
# Get letters only for relationships
df1.Relationship = df1["Relationship"].str.extract('^([a-zA-Z]+)', expand=False)
# take 2 random samples and concatenate
out = pd.concat([df1.sample(10).reset_index(drop=True),
df1.sample(10).reset_index(drop=True)],
axis=1, ignore_index=True)
# filter for not equal characteristics only
out = out.loc[out[1].ne(out[3])]
# get similar characteristics (based on intersection of letters)
out["Relationship"] = out.apply(lambda row: "".join(list(set(row[1]) & set(row[3]))), axis=1)
# required columns only
out = out[[0, 2, "Relationship"]].rename(columns={0: "Text1", 2: "Text2"})
Example output:
out.to_dict()
# Out[]:
# {'Text1': {0: 'Cows are enjoying',
# 2: 'Cows are happy',
# 3: 'Spinach is green',
# 4: 'Vegetables are green',
# 5: 'I enjoy Mangoes',
# 6: 'Beautiful butterflies are delightful to watch',
# 7: 'Mango pleases me'},
# 'Text2': {0: 'Vegetables are green',
# 2: 'I enjoy Mangoes',
# 3: 'Beautiful butterflies are delightful to watch',
# 4: 'Mango pleases me',
# 5: 'Cows are enjoying',
# 6: 'Some Plants are good Vegetables',
# 7: 'Cows are happy'},
# 'Relationship': {0: '', 2: '', 3: '', 4: "'P'", 5: '', 6: '', 7: ''}}
Upvotes: 1