Reputation: 5117
I do the following:
import pandas as pd
df_texts = pd.read_csv('data_texts.csv', keep_default_na=True)
for index, row in df_texts.iterrows():
list_of_words = row['text'].split()
df_texts.loc[index, '#_words'] = len(list_of_words)
list_of_unique_words = set(list_of_words)
df_texts.loc[index, '#_unique_words'] = len(list_of_unique_words)
The problem is that the numbers at the #_words
and at the #_unique_words
columns are stored as floats even though they are integers.
Just to clarify that these two columns do not pre-exist in the .csv which I read (pd.read_csv
) but I create them in the for
loop.
How can I directly store them as integers?
Upvotes: 2
Views: 122
Reputation: 4689
If you create the column by assigning a value to a single row, all the other rows are implicitly initialized to NaN
, which is a floating point value. This forces the entire column to float
.
(You will also notice this if you try to convert the column using df_texts['#_words'] = df_texts['#_words'].astype(int)
before all values have been set. It will fail because NaN
cannot be converted to int
.)
Therefore, the column can't become an integer column until all values are set. The problem goes away if you initialize the entire column with df_texts['#_words'] = 0
before the loop.
Edit: Also, as the other answers have pointed out, this assignment can be done without using a loop in the first place.
Upvotes: 0
Reputation: 6246
A better way to do this and directly get ints is to assign the new columns directly, and avoid iterating through the dataframe altogether.
With some dummy data for an example:
import pandas as pd
texts = ['word1 word2 word3', 'word1 word2 word1', 'word3']
df_texts = pd.DataFrame(texts, columns = ['text'])
text
0 word1 word2 word3
1 word1 word2 word1
2 word3
Calculate the length for all rows using the text column separately and then assign.
temp = df_texts['text'].str.split()
df_texts['#_words'] = [len(row) for row in temp] #iterates and creates a list of all lengths. assign to df
df_texts['#_unique_words'] = [len(set(row)) for row in temp]
print(df_texts)
#Output:
text #_words #_unique_words
0 word1 word2 word3 3 3
1 word1 word2 word1 3 2
2 word3 1 1
Upvotes: 1
Reputation: 10030
You can apply int function to needed column:
df= pd.DataFrame({
'n':[1.12, 1.2345, 5.234]
})
df['n'] = df['n'].apply(lambda x: int(x))
df
n
0 1
1 1
2 5
Upvotes: 0