Reputation: 91
I have a dataframe, and one column contains the string description of movies in Danish:
df.Description.tail()
24756 Der er nye kendisser i rundkredsen, nemlig Ski...
24757 Hvad får man, hvis man blander en gruppe af k...
24758 Hvordan vælter man en minister? Hvordan ødel...
24759 Der er dømt mandehygge i hulen hos ZULUs tera...
24760 Kender du de dage på arbejdet, hvor alt bare ...
I first check that all the values of the column Description
are strings:
df.applymap(type).eq(str).all()
Video.ID.v26 False
Title True
Category False
Description True
dtype: bool
What I want is to create another column which contains the words found in each string, separated with a , like this:
24756 [Der, er, nye, kendisser, i, rundkredsen, ...
In my loop, I also use Rake() to delete Danish stop words. Here is my loop:
# initializing the new column
df['Key_words'] = ""
for index, row in df.iterrows():
plot = row['Description']
# instantiating Rake, by default is uses english stopwords from NLTK, but we want Danish
# and discard all puntuation characters
r = Rake('da')
# extracting the words by passing the text
r.extract_keywords_from_text(plot)
# getting the dictionary whith key words and their scores
key_words_dict_scores = r.get_word_degrees()
# assigning the key words to the new column
row['Key_words'] = list(key_words_dict_scores.keys())
The problem is that the new column Key_words
is empty...
df.Key_words.tail()
24756
24757
24758
24759
24760
Name: Key_words, dtype: object
Any help appreciated.
Upvotes: 3
Views: 160
Reputation: 502
Use apply
def my_keyword_func(row):
plot = row['Description']
....
return ['key word 1', 'key word 2']
df['Key_words'] = df.apply(my_keyword_func, axis=1)
Upvotes: 0
Reputation: 9482
From the documentation of df.iterrows:
You should never modify something you are iterating over. This is not guaranteed to work in all cases. Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect.
In your case, this combination of lines is the problem:
for index, row in df.iterrows(): # row is generated
[...]
row['Key_words'] = list(key_words_dict_scores.keys()) # row is modified
If you want to use iterrows, you could circumvent situations like the above, for instance by storing intermediate data in a list, like so:
import pandas as pd
# make dummy dataframe
df = pd.DataFrame({'a':range(5)})
#initialise list
new_entries = []
# do iterrows, and operations on entries in row
for ix, row in df.iterrows():
new_entries.append(2* row['a']) # store intermediate data in list
df['b'] = new_entries # assign temp data to new column
One other piece of advice: I had to generate my own dataframe to illustrate my solution, because the format in which you posted your data does not allow easy import/copying. please check out this post to be able to ask better formulated questions.
Upvotes: 1