Reputation: 191
I have a database of name records that I'm trying to create bigrams for and have the bigrams turned into new rows in the dataframe. The reason I'm doing this is because there are certain records that contain multiple names and also some can have different orders for the same name. My ultimate goal is to look for duplicates and create one ultimate record for each unique individual. I plan to use TF-IDF and cosine similarity on the results of this. Below is an example of what I'm trying to do.
Upvotes: 2
Views: 464
Reputation: 9619
bigrams = [[id, ' '.join(b)] for id, l in zip(df['ID'].tolist(), df['Name'].tolist()) for b in zip(l.split(" ")[:-1], l.split(" ")[1:])]
bigrams_df = pd.DataFrame(bigrams, columns = ['ID','Name'])
Upvotes: 1
Reputation: 13349
try using zip
,apply
and explode
:
df.Name = df.Name.str.split()
df.Name.apply(lambda x: tuple(zip(x,x[1:]))).explode().map(lambda x: f"{x[0]} {x[1]}")
Or
using list comprehension:
df2 = pd.Series([ f"{a} {b}" for val in df.Name for (a,b) in (zip(val,val[1:]))])
0 John Doe
1 John Doe
1 Doe Mike
1 Mike Smith
2 John Doe
2 Doe Mike
2 Mike Smith
2 Smith Steve
2 Steve Johnson
3 Smith Mike
3 Mike J.
3 J. Doe
3 Doe Johnson
3 Johnson Steve
4 Steve J.
4 J. M
4 M Smith
Name: Name, dtype: object
edit:
2nd part:
df2 = pd.DataFrame([ [idx+1, f"{a} {b}"] for idx,val in enumerate(df.Name) for (a,b) in (zip(val,val[1:]))], columns=['ID', 'Names'])
Upvotes: 3