Reputation: 91
I am trying to run a for loop on a long dataframe and count the number of English and non-English words in a given text (each text is a new row).
+-------+--------+----+
| Index | Text | ID |
+-------+--------+----+
| 1 | Text 1 | 1 |
| 2 | Text 2 | 2 |
| 3 | Text 3 | 3 |
+-------+--------+----+
This is my code
c = 0
for text in df_letters['Text_clean']:
# Counters
CTEXT= text
c +=1
eng_words = 0
non_eng_words = 0
text = " ".join(text.split())
# For every word in text
for word in text.split(' '):
# Check if it is english
if english_dict.check(word) == True:
eng_words += 1
else:
non_eng_words += 1
# Print the result
# NOTE that these results are discarded each new text
df_letters.at[text, 'eng_words'] = eng_words
df_letters.at[text, 'non_eng_words'] = non_eng_words
df_letters.at[text, 'Input'] = CTEXT
#print('Index: {}; EN: {}; NON-EN: {}'.format(c, eng_words, non_eng_words))
but instead of getting the same dataframe i used as input with 3 new columns
+-------+--------+----+---------+-------------+---------+
| Index | Text | ID | English | Non-English | Input |
+-------+--------+----+---------+-------------+---------+
| 1 | Text 1 | 1 | 1 | 0 | Text 1 |
| 2 | Text 2 | 2 | 1 | 0 | Text 2 |
| 3 | Text 3 | 3 | 0 | 1 | Text 3 |
+-------+--------+----+---------+-------------+---------+
the dataframe is duplicating in length, adding new rows for each new text. like this
+--------+--------+-----+---------+-------------+--------+
| Index | Text | ID | English | Non-English | Input |
+--------+--------+-----+---------+-------------+--------+
| 1 | Text 1 | 1 | nan | nan | nan |
| 2 | Text 2 | 2 | nan | nan | nan |
| 3 | Text 3 | 3 | nan | nan | nan |
| Text 1 | nan | nan | 1 | 0 | Text 1 |
| text 2 | nan | nan | 1 | 0 | Text 2 |
| Text 3 | nan | nan | 0 | 1 | Text 3 |
+--------+--------+-----+---------+-------------+--------+
What am i doing wrong here?
Upvotes: 0
Views: 41
Reputation: 1351
The Series.at
access the DataFrame by the index value. The index of your DataFrame are [1,2,3]
and not [Text 1, Text 2, Text 3]
. I think the best solution for you is to replace your loop by one like this:
for index, text in df_letters['Text_clean'].iteritems():
where index will be then you can do:
df_letters.at[index, 'eng_words'] = eng_words
Upvotes: 1