Alex S
Alex S

Reputation: 91

get a list of key words from a string

I have a dataframe, and one column contains the string description of movies in Danish:

df.Description.tail()

24756    Der er nye kendisser i rundkredsen, nemlig Ski...
24757    Hvad får man, hvis man blander en gruppe af k...
24758    Hvordan vælter man en minister? Hvordan ødel...
24759    Der er dømt mandehygge i hulen hos ZULUs tera...
24760    Kender du de dage på arbejdet, hvor alt bare ...

I first check that all the values of the column Description are strings: df.applymap(type).eq(str).all()

Video.ID.v26    False
Title            True
Category        False
Description      True
dtype: bool

What I want is to create another column which contains the words found in each string, separated with a , like this:

24756   [Der, er, nye, kendisser, i, rundkredsen, ...

In my loop, I also use Rake() to delete Danish stop words. Here is my loop:

# initializing the new column
df['Key_words'] = ""

for index, row in df.iterrows():
    plot = row['Description']

    # instantiating Rake, by default is uses english stopwords from NLTK, but we want Danish
    # and discard all puntuation characters
    r = Rake('da')

    # extracting the words by passing the text
    r.extract_keywords_from_text(plot)

    # getting the dictionary whith key words and their scores
    key_words_dict_scores = r.get_word_degrees()

    # assigning the key words to the new column
    row['Key_words'] = list(key_words_dict_scores.keys())

The problem is that the new column Key_words is empty...

df.Key_words.tail()

24756    
24757    
24758    
24759    
24760    
Name: Key_words, dtype: object

Any help appreciated.

Upvotes: 3

Views: 160

Answers (2)

Ali Faizan
Ali Faizan

Reputation: 502

Use apply

def my_keyword_func(row):
    plot = row['Description']
    ....
    return ['key word 1', 'key word 2']
df['Key_words'] = df.apply(my_keyword_func, axis=1)

Upvotes: 0

warped
warped

Reputation: 9482

From the documentation of df.iterrows:

You should never modify something you are iterating over. This is not guaranteed to work in all cases. Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect.

In your case, this combination of lines is the problem:

for index, row in df.iterrows():  # row is generated
    [...]
    row['Key_words'] = list(key_words_dict_scores.keys()) # row is modified

If you want to use iterrows, you could circumvent situations like the above, for instance by storing intermediate data in a list, like so:

import pandas as pd

# make dummy dataframe
df = pd.DataFrame({'a':range(5)})

#initialise list
new_entries = []

# do iterrows, and operations on entries in row
for ix, row in df.iterrows():
    new_entries.append(2* row['a'])  # store intermediate data in list

df['b'] = new_entries # assign temp data to new column

One other piece of advice: I had to generate my own dataframe to illustrate my solution, because the format in which you posted your data does not allow easy import/copying. please check out this post to be able to ask better formulated questions.

Upvotes: 1

Related Questions