lucas winter
lucas winter

Reputation: 191

How to split a string in a pandas dataframe into bigrams that can then exploded into new rows?

I have a database of name records that I'm trying to create bigrams for and have the bigrams turned into new rows in the dataframe. The reason I'm doing this is because there are certain records that contain multiple names and also some can have different orders for the same name. My ultimate goal is to look for duplicates and create one ultimate record for each unique individual. I plan to use TF-IDF and cosine similarity on the results of this. Below is an example of what I'm trying to do.

Current: enter image description here

Goal: enter image description here

Upvotes: 2

Views: 464

Answers (2)

RJ Adriaansen
RJ Adriaansen

Reputation: 9619

bigrams = [[id, ' '.join(b)] for id, l in zip(df['ID'].tolist(), df['Name'].tolist()) for b in zip(l.split(" ")[:-1], l.split(" ")[1:])]
bigrams_df = pd.DataFrame(bigrams, columns = ['ID','Name'])

Upvotes: 1

Pygirl
Pygirl

Reputation: 13349

try using zip,apply and explode:

df.Name = df.Name.str.split() 

df.Name.apply(lambda x: tuple(zip(x,x[1:]))).explode().map(lambda x: f"{x[0]} {x[1]}")

Or

using list comprehension:

df2 = pd.Series([ f"{a} {b}" for val in df.Name for (a,b) in (zip(val,val[1:]))])

0         John Doe
1         John Doe
1         Doe Mike
1       Mike Smith
2         John Doe
2         Doe Mike
2       Mike Smith
2      Smith Steve
2    Steve Johnson
3       Smith Mike
3          Mike J.
3           J. Doe
3      Doe Johnson
3    Johnson Steve
4         Steve J.
4             J. M
4          M Smith
Name: Name, dtype: object

edit:

2nd part:

df2 = pd.DataFrame([ [idx+1, f"{a} {b}"] for idx,val in enumerate(df.Name) for (a,b) in (zip(val,val[1:]))], columns=['ID', 'Names'])

Upvotes: 3

Related Questions