How to split a string in a pandas dataframe into bigrams that can then exploded into new rows?

Question

I have a database of name records that I'm trying to create bigrams for and have the bigrams turned into new rows in the dataframe. The reason I'm doing this is because there are certain records that contain multiple names and also some can have different orders for the same name. My ultimate goal is to look for duplicates and create one ultimate record for each unique individual. I plan to use TF-IDF and cosine similarity on the results of this. Below is an example of what I'm trying to do.

Current:

Goal:

RJ Adriaansen · Accepted Answer

bigrams = [[id, ' '.join(b)] for id, l in zip(df['ID'].tolist(), df['Name'].tolist()) for b in zip(l.split(" ")[:-1], l.split(" ")[1:])]
bigrams_df = pd.DataFrame(bigrams, columns = ['ID','Name'])

How to split a string in a pandas dataframe into bigrams that can then exploded into new rows?

Answers (2)

Related Questions