Reputation: 5117

Store integers as integers and not as floats

I do the following:

import pandas as pd

df_texts = pd.read_csv('data_texts.csv', keep_default_na=True)

for index, row in df_texts.iterrows():   

    list_of_words = row['text'].split()

    df_texts.loc[index, '#_words'] = len(list_of_words)

    list_of_unique_words = set(list_of_words)  

    df_texts.loc[index, '#_unique_words'] = len(list_of_unique_words)

The problem is that the numbers at the #_words and at the #_unique_words columns are stored as floats even though they are integers.

Just to clarify that these two columns do not pre-exist in the .csv which I read (pd.read_csv) but I create them in the for loop.

How can I directly store them as integers?

Upvotes: 2

Answers (3)

Christoph Burschka

Reputation: 4689

If you create the column by assigning a value to a single row, all the other rows are implicitly initialized to NaN, which is a floating point value. This forces the entire column to float.

(You will also notice this if you try to convert the column using df_texts['#_words'] = df_texts['#_words'].astype(int) before all values have been set. It will fail because NaN cannot be converted to int.)

Therefore, the column can't become an integer column until all values are set. The problem goes away if you initialize the entire column with df_texts['#_words'] = 0 before the loop.

Edit: Also, as the other answers have pointed out, this assignment can be done without using a loop in the first place.

Upvotes: 0

Paritosh Singh

Reputation: 6246

A better way to do this and directly get ints is to assign the new columns directly, and avoid iterating through the dataframe altogether.

With some dummy data for an example:

import pandas as pd
texts = ['word1 word2 word3', 'word1 word2 word1', 'word3']

df_texts = pd.DataFrame(texts, columns = ['text'])
                text
0  word1 word2 word3
1  word1 word2 word1
2              word3

Calculate the length for all rows using the text column separately and then assign.

temp = df_texts['text'].str.split()
df_texts['#_words'] = [len(row) for row in temp] #iterates and creates a list of all lengths. assign to df
df_texts['#_unique_words'] = [len(set(row)) for row in temp]

print(df_texts)
#Output:
                text  #_words  #_unique_words
0  word1 word2 word3        3               3
1  word1 word2 word1        3               2
2              word3        1               1

Upvotes: 1

vurmux

Reputation: 10030

You can apply int function to needed column:

df= pd.DataFrame({
    'n':[1.12, 1.2345, 5.234]
})
df['n'] = df['n'].apply(lambda x: int(x))
df

Upvotes: 0

Store integers as integers and not as floats

Answers (3)

Related Questions