ankitpandey
ankitpandey

Reputation: 349

Don't Process Null Values from CSV File using pandas

I have a file where multiple values are null, i need to change the values only where values are present and keep null values. Below is what i am trying but it is changing Null values (?) also. Please suggest what can be done.

Input File

A           B         C
XC123     CXW12     3.43
XC123       ?       11.44
CQ123     AB123     21.23
XC781       ?       44.22
SC568     AB123     2.12
SC568       ?       32.43
DC743     CXW12     324.78
XC123       ?       -6432.93

Expected Output File:

A       B         C
A0      B0      3.43
A0              11.44
A1      B1      21.23
A2              44.22
A3      B1      2.12
A3              32.43
A4      B0      324.78
A0              -6432.93

Code:

df=pd.read_csv('file.csv')

df_mask=pd.DataFrame({
            'A':['A{}'.format(i) for i in list(pd.factorize(df['A'])[0])],
            'B':['B{}'.format(i) for i in list(pd.factorize(df['B'])[0])],
            'C':df['C'].values.tolist(),
            })

df_mask.to_csv(finalOutput, sep=',', index=False)

Upvotes: 2

Views: 95

Answers (1)

jezrael
jezrael

Reputation: 862581

Idea is use na_values parameter in read_csv for convert ? to mising values, then factorize by default return -1 for them, so added if-else statement:

df=pd.read_csv('file.csv', na_values=['?'])

df_mask=pd.DataFrame({
            'A':['A{}'.format(i) if i != -1 else '' for i in (pd.factorize(df['A'])[0])],
            'B':['B{}'.format(i) if i != -1 else ''  for i in (pd.factorize(df['B'])[0])],
            'C':df['C'].values.tolist(),
            })

print (df_mask)
    A   B        C
0  A0  B0     3.43
1  A0        11.44
2  A1  B1    21.23
3  A2        44.22
4  A3  B1     2.12
5  A3        32.43
6  A4  B0   324.78
7  A0     -6432.93

Another idea is use numpy.where:

a = pd.Series(pd.factorize(df['A'])[0])
b = pd.Series(pd.factorize(df['B'])[0])
df_mask=pd.DataFrame({
            'A':np.where(a != -1, 'A' + a.astype(str), ''),
            'B':np.where(b != -1, 'B' + b.astype(str), ''),
            'C':df['C'].values,
            })

print (df_mask)
    A   B        C
0  A0  B0     3.43
1  A0        11.44
2  A1  B1    21.23
3  A2        44.22
4  A3  B1     2.12
5  A3        32.43
6  A4  B0   324.78
7  A0     -6432.93

Upvotes: 3

Related Questions