Ignore rows of incorrect data type while doing df.read_csv()

Question

I have a 78 Gb tsv file that I read columns from to export to smaller files. In the original file, there is a column "QUANTITY' that contains mainly integer values with an occasional random value such as "0.1.04.23.01.23.05.03.01.04.03.05.02.07.23..."

These random values mess with my ability to use the column, sometimes pandas just ignores them and lets me treat the column as a float and other times it throws fits.

I would like a way to remove these values from the column as I load the column.

Also, due to the file size, I load the file using 'chinksize' as follows:

cols_keep = ['REPORTER_DEA_NO',
             'REPORTER_NAME',
             'BUYER_DEA_NO',
             'BUYER_NAME',
             'QUANTITY']

dd_wash = pd.read_csv('arcos_all_washpost.tsv', sep='	',  
usecols=cols_keep, chunksize=18000000)

i = 1
for chunk in dd_wash:
    print('chunk read')
    print(datetime.datetime.now())

    chunk.to_csv('RepBuyNames.csv', mode='a')
    print('end loop', i)
    print(datetime.datetime.now())
    i += 1"""

Joe · Accepted Answer

This value 0.1.04.23.01.23.05.03.01.04.03.05.02.07.23 under your Quantity column will be read by pandas as a string.

Which means that your Quantity column will now be an object type; all other values will be a string.

You have to address the issue by tagging all values within the Quantity column that would return false in doing the string method isdigit().

But we want to avoid looping since your file is large so use numpy.where() method.

Take this sample:

>>> df = pd.DataFrame({'A':[1,2,3,4],'B':[5,6,7,'0.1.04.23.01.23.05.03.01.04.03.05.02.07.23']})
>>> df
   A                                           B
0  1                                           5
1  2                                           6
2  3                                           7
3  4  0.1.04.23.01.23.05.03.01.04.03.05.02.07.23

>>> df.info()

RangeIndex: 4 entries, 0 to 3
Data columns (total 2 columns):
A    4 non-null int64
B    4 non-null object
dtypes: int64(1), object(1)
memory usage: 144.0+ bytes

>>> df['C'] = np.where(df['B'].str.isdigit(), ['Retain'], ['Delete'])
>>> df
   A                                           B       C
0  1                                           5  Retain
1  2                                           6  Retain
2  3                                           7  Retain
3  4  0.1.04.23.01.23.05.03.01.04.03.05.02.07.23  Delete

You can now decide how to proceed:

Remove row if C column value == 'Delete'

>>> df = df[~df['C'].isin(['Delete'])]
>>> df
   A  B       C
0  1  5  Retain
1  2  6  Retain
2  3  7  Retain

Note that doing this will remove the entire row, if you need the information from other column in this row. Maybe just update the ones tagged as rubbish.

Update value of B if column C == 'Delete'

>>> df['B'] = np.where(df['C'] == 'Delete', [123456789], df['B'])
>>> df
   A          B       C
0  1          5  Retain
1  2          6  Retain
2  3          7  Retain
3  4  123456789  Delete

>>> df['B'] = df['B'].astype(int)
>>> df
   A          B       C
0  1          5  Retain
1  2          6  Retain
2  3          7  Retain
3  4  123456789  Delete

>>> df.info()

RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
A    4 non-null int64
B    4 non-null int32
C    4 non-null object
dtypes: int32(1), int64(1), object(1)
memory usage: 160.0+ bytes

We have updated the value from 0.1.04.23.01.23.05.03.01.04.03.05.02.07.23 into an int 123456789.

Hope this helps :))

Ignore rows of incorrect data type while doing df.read_csv()

Answers (1)

Related Questions