Navot Naor
Navot Naor

Reputation: 91

pandas for loop duplicating rows

I am trying to run a for loop on a long dataframe and count the number of English and non-English words in a given text (each text is a new row).

+-------+--------+----+
| Index |  Text  | ID |
+-------+--------+----+
|     1 | Text 1 |  1 |
|     2 | Text 2 |  2 |
|     3 | Text 3 |  3 |
+-------+--------+----+
     

This is my code

c = 0
for text in df_letters['Text_clean']:
    # Counters
    CTEXT= text
    c +=1
    eng_words = 0
    non_eng_words = 0
    text = " ".join(text.split())
    # For every word in text
    for word in text.split(' '):
      # Check if it is english
      if english_dict.check(word) == True:
        eng_words += 1
      else:
        non_eng_words += 1
    # Print the result
    # NOTE that these results are discarded each new text
    df_letters.at[text, 'eng_words'] = eng_words
    df_letters.at[text, 'non_eng_words'] = non_eng_words
    df_letters.at[text, 'Input'] = CTEXT
    #print('Index: {}; EN: {}; NON-EN: {}'.format(c, eng_words, non_eng_words))

but instead of getting the same dataframe i used as input with 3 new columns

+-------+--------+----+---------+-------------+---------+
| Index |  Text  | ID | English | Non-English |  Input  |
+-------+--------+----+---------+-------------+---------+
|     1 | Text 1 |  1 |       1 |           0 | Text 1  |
|     2 | Text 2 |  2 |       1 |           0 | Text 2  |
|     3 | Text 3 |  3 |       0 |           1 | Text 3  |
+-------+--------+----+---------+-------------+---------+

the dataframe is duplicating in length, adding new rows for each new text. like this

+--------+--------+-----+---------+-------------+--------+
| Index  |  Text  | ID  | English | Non-English | Input  |
+--------+--------+-----+---------+-------------+--------+
| 1      | Text 1 | 1   | nan     | nan         | nan    |
| 2      | Text 2 | 2   | nan     | nan         | nan    |
| 3      | Text 3 | 3   | nan     | nan         | nan    |
| Text 1 | nan    | nan | 1       | 0           | Text 1 |
| text 2 | nan    | nan | 1       | 0           | Text 2 |
| Text 3 | nan    | nan | 0       | 1           | Text 3 |
+--------+--------+-----+---------+-------------+--------+

What am i doing wrong here?

Upvotes: 0

Views: 41

Answers (1)

Flavio Moraes
Flavio Moraes

Reputation: 1351

The Series.at access the DataFrame by the index value. The index of your DataFrame are [1,2,3] and not [Text 1, Text 2, Text 3]. I think the best solution for you is to replace your loop by one like this:

for index, text in df_letters['Text_clean'].iteritems():

where index will be then you can do:

df_letters.at[index, 'eng_words'] = eng_words

Upvotes: 1

Related Questions