Reputation: 47
I have a 78 Gb tsv file that I read columns from to export to smaller files. In the original file, there is a column "QUANTITY' that contains mainly integer values with an occasional random value such as "0.1.04.23.01.23.05.03.01.04.03.05.02.07.23..."
These random values mess with my ability to use the column, sometimes pandas just ignores them and lets me treat the column as a float and other times it throws fits.
I would like a way to remove these values from the column as I load the column.
Also, due to the file size, I load the file using 'chinksize' as follows:
cols_keep = ['REPORTER_DEA_NO',
'REPORTER_NAME',
'BUYER_DEA_NO',
'BUYER_NAME',
'QUANTITY']
dd_wash = pd.read_csv('arcos_all_washpost.tsv', sep='\t',
usecols=cols_keep, chunksize=18000000)
i = 1
for chunk in dd_wash:
print('chunk read')
print(datetime.datetime.now())
chunk.to_csv('RepBuyNames.csv', mode='a')
print('end loop', i)
print(datetime.datetime.now())
i += 1"""
Upvotes: 2
Views: 903
Reputation: 889
This value 0.1.04.23.01.23.05.03.01.04.03.05.02.07.23
under your Quantity
column will be read by pandas as a string.
Which means that your Quantity
column will now be an object type; all other values will be a string.
You have to address the issue by tagging all values within the Quantity
column that would return false in doing the string method isdigit()
.
But we want to avoid looping since your file is large so use numpy.where()
method.
Take this sample:
>>> df = pd.DataFrame({'A':[1,2,3,4],'B':[5,6,7,'0.1.04.23.01.23.05.03.01.04.03.05.02.07.23']})
>>> df
A B
0 1 5
1 2 6
2 3 7
3 4 0.1.04.23.01.23.05.03.01.04.03.05.02.07.23
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 2 columns):
A 4 non-null int64
B 4 non-null object
dtypes: int64(1), object(1)
memory usage: 144.0+ bytes
>>> df['C'] = np.where(df['B'].str.isdigit(), ['Retain'], ['Delete'])
>>> df
A B C
0 1 5 Retain
1 2 6 Retain
2 3 7 Retain
3 4 0.1.04.23.01.23.05.03.01.04.03.05.02.07.23 Delete
You can now decide how to proceed:
Remove row if C column value == 'Delete'
>>> df = df[~df['C'].isin(['Delete'])]
>>> df
A B C
0 1 5 Retain
1 2 6 Retain
2 3 7 Retain
Note that doing this will remove the entire row, if you need the information from other column in this row. Maybe just update the ones tagged as rubbish.
Update value of B if column C == 'Delete'
>>> df['B'] = np.where(df['C'] == 'Delete', [123456789], df['B'])
>>> df
A B C
0 1 5 Retain
1 2 6 Retain
2 3 7 Retain
3 4 123456789 Delete
>>> df['B'] = df['B'].astype(int)
>>> df
A B C
0 1 5 Retain
1 2 6 Retain
2 3 7 Retain
3 4 123456789 Delete
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
A 4 non-null int64
B 4 non-null int32
C 4 non-null object
dtypes: int32(1), int64(1), object(1)
memory usage: 160.0+ bytes
We have updated the value from 0.1.04.23.01.23.05.03.01.04.03.05.02.07.23
into an int 123456789
.
Hope this helps :))
Upvotes: 2