bpr
bpr

Reputation: 477

Failing to convert column in pandas dataframe to integer data type

I have this code which manipulates a data set to create a new column by pulling info from an existing column. In order to match the data properly using a pd.merge function with another data set, I would like to convert the 'Channel ID' column to integers. Despite the current use of .astype(int) the results data type shows up as float64 looking at frame with .info()

def cost(received_frame):
    received_frame.columns = ['Campaign', 'Ad Spend']
    campaigns = received_frame['Campaign']
    ID = []
    for c in campaigns:
        blocks = re.split('_', c)
        for block in blocks[1:]:
            if len(block) == 6 and block.isdigit(): 
                ID.append(block)
    ID = pd.Series(ID).str.replace("'","")
    ID = pd.DataFrame(ID)
    both = [ID,received_frame]
    frame = pd.concat(both,axis=1)
    frame.columns = ['Channel ID', 'Campaign', 'Ad Spend']
    frame['Channel ID'] = frame['Channel ID'].dropna().astype(int)
    return frame

Upvotes: 3

Views: 5431

Answers (2)

unutbu
unutbu

Reputation: 879501

Suppose frame looks like this:

import numpy as np
import pandas as pd
frame = pd.DataFrame({'Channel ID':['1',np.nan,'2'], 'foo':['bar','baz',np.nan]})

  Channel ID  foo
0          1  bar
1        NaN  baz
2          2  NaN

You could drop rows from frame where Channel ID is NaN:

mask = pd.notnull(frame['Channel ID'])
frame = frame.loc[mask]

and then astype(int) will successful convert the column to dtype int:

frame['Channel ID'] = frame['Channel ID'].astype(int)

yields

   Channel ID  foo
0           1  bar
2           2  NaN

As Ami Tavory explained, you can't drop the NaNs solely from frame['Channel ID'] with

frame['Channel ID'] = frame['Channel ID'].dropna()

because upon assignment aligns the index on the right-hand side with the relevant rows on the left-hand side. It has no effect on the rows on the left whose index is not mentioned on the right-hand side. So the NaNs remain in the bigger DataFrame, frame.

Since NaN is a float value, the dtype must remain a float dtype as long as the column contains NaNs.

Upvotes: 3

Ami Tavory
Ami Tavory

Reputation: 76297

When you write

frame['Channel ID'].dropna().astype(int)

You're returning a series with possibly fewer indices, as you're dropping NAs.

Then, when you assign it as

frame['Channel ID'] = frame['Channel ID'].dropna().astype(int)

It performs a sort of merge with the existing values (according to the indices), and those are floats, so it must convert these too.

You should replace it with something else, depending on your problem (fillna?).

Upvotes: 5

Related Questions