Reputation: 147
I have a csv file that I need to read and parse as Pandas dataframe. Theoretically, all the columns should follow a known schema of numerical data and strings. I know that some records are broken, either with less number of fields or wrong order.
What I would like to do is to get rid of all these problematic rows.
As a reference, on PySpark I used to use 'DROPMALFORMED'
to filter out records that don't match the schema.
dataSchema = StructType([
StructField("col1", LongType(), True),
StructField("col2", StringType(), True)])
dataFrame = sqlContext.read \
.format('com.databricks.spark.csv') \
.options(header='false', delimiter='\t', mode='DROPMALFORMED') \
.load(filename, schema = dataSchema)
With Pandas, I cannot find a simple way to do so. For instance, I thought the this snippet would do the trick but instead it just copies back the wrong value instead of dropping it.
dataFrame['col1'] = dataFrame['col1'].astype(np.int64, errors='ignore')
Upvotes: 0
Views: 1194
Reputation: 343
May be pandas.to_numeric
will help. It has errors='coerce'
option, which replaces all wrong values with NaN
. Than, you can use dropna()
function to remove rows containing NaN
:
import pandas as pd
df=pd.DataFrame([[1,2,3],[4,5,6],[7,'F',8]],columns=['col1','col2','col3'])
df['col2']=pd.to_numeric(df['col2'],errors='coerce')
df.dropna(inplace=True)
Upvotes: 1