Is there an easy way to split a large string on a Pandas DataFrame into equal number of words?

Question

I have a Data-set consisting of a 1000 rows containing a given author and a large corpus of text belonging to said author. What I am ultimately trying to achieve is to explode the text rows into multiple rows containing the same number of words, as in:

Author - - - - - - - - text

Jack - - - - - - -- - -"This is a sentence that contains eight words" 

John - - - - - - - - -"This is also a sentence containing eight words"

So if I wanted to do it for 4-word chunks it would be:

Author- - - - - - text

Jack- - - - - - - "This is a sentence" 

Jack- - - -  - - -"that contains eight words" 


John- - - - - - - "This is also a"

John- - - - - - - "sentence containing eight words"

I can already do it by number of characters using textwrapper, but ideally I would want to do it by number of words. Any help that can lead to that will be highly appreciated, Thanks!

Reuven Chacha · Accepted Answer

Assuming you're using pandas >= 0.25 (which supports df.explode), you could use the following method:

def split_by_equal_number_of_words(df, num_of_words, separator=" "):
    """
      1. Split each text entry to a list separated by 'separator'
      2. Explode to a row per word
      3. group by number of the desired words, and aggregate by joining with the 'separator' provided 
    :param df:
    :param num_of_words:
    :param separator:
    :return:
    """
    df["text"] = df["text"].str.split(separator)
    df = df.explode("text").reset_index(drop=True)
    df = df.groupby([df.index // num_of_words, 'author'])['text'].agg(separator.join)
    return df

Is there an easy way to split a large string on a Pandas DataFrame into equal number of words?

Answers (1)

Related Questions