Reputation: 11
i'm trying to do some date cleaning on some huge datasets, and I'm new to python (I've used google to search for my problem), so please bear over me with my terminology.
The data is imported from a CSV into a pandas.core.frame.DataFrame Some of my columns should only contain numbers and others only text:
CPRNUM REQ_SAMPLETIME SAMPLE_ID RESULT
0 1234567890 2014-05-30 07:59 50226686 0.7409090909090907
1 The sample was.. 2013-09-04 07:45 47721186 0.8290909090909093
2 1234567890 The sample was.. 46473016 1.0918181818181818
I would really like to get rid of the rows within column REQ_SAMPLETIME and CPRNUM which is not 10 digits long, and contain text, so it would look like:
CPRNUM REQ_SAMPLETIME SAMPLE_ID RESULT
0 1234567890 2014-05-30 07:59 50226686 0.7409090909090907
3 0987654321 2018-06-10 05:32 12354678 3.7290909090909093
4 1234567890 2013-09-04 07:45 15672687 5.9999951818181818
Thanks for the help
Thanks to Danish Bansal, I used your code, as it fits my problem the most:
My final code looks like this:
hba1c = pd.read_csv(r"C:\sample.csv", encoding = 'unicode_escape', engine ='python', sep = ';')
#This function checks if the number is valid
def isCPRNUMvalid(val):
if len(val) == 10: #if string has length of 10
if val.isnumeric(): #if string is pure number
return True
return False
hba1c['validCRPNUM'] = hba1c['CPRNUM'].apply(isCPRNUMvalid)
hba1c = hba1c[hba1c.validCRPNUM]
hba1c['Dates'] = pd.to_datetime(hba1c['REQ_SAMPLETIME']).dt.date
C:\Users\tphni\.conda\envs\py37\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
"""Entry point for launching an IPython kernel.
hba1c.head()
I get the error message as typed above in the code, but it still seems to work, and the output is as excepted, with the correct number of fewer rows than the original file.
Upvotes: 1
Views: 325
Reputation: 5281
I would suggest casting your columns to the desired format (e.g. numeric;datetime
) and removing rows with problems:
v0 = df.CPRNUM.astype(str).str.len() == 10
v1 = pd.to_numeric(df.CPRNUM, errors="coerce").notna()
v2 = pd.to_datetime(df.REQ_SAMPLETIME, errors="coerce").notna()
filtered = df[v0 * v1 * v2]
print(filtered)
Explanation:
pd.to_<type>
casts the input to the given typeerrors="coerce"
turns the problematic elements to NAs.notna()
gives a boolean vector with true where the value is not NA*
multiplying the boolean vectors gives us the vector corresponding to the and
operatorData
import pandas as pd
# Sample data
records = [
dict(CPRNUM="1234567890", REQ_SAMPLETIME="2014-05-30 07:59", SAMPLE_ID=50226686, RESULT=0.7409090909090907),
dict(CPRNUM="The sample was..", REQ_SAMPLETIME="2014-05-30 07:59", SAMPLE_ID=50226686, RESULT=0.7409090909090907),
dict(CPRNUM="0987654321", REQ_SAMPLETIME="The sample was..", SAMPLE_ID=50226686, RESULT=0.7409090909090907),
dict(CPRNUM="123", REQ_SAMPLETIME="The sample was..", SAMPLE_ID=50226686, RESULT=0.7409090909090907)
]
df = pd.DataFrame.from_records(records)
print(df)
# CPRNUM REQ_SAMPLETIME SAMPLE_ID RESULT
# 0 1234567890 2014-05-30 07:59 50226686 0.740909
# 1 The sample was.. 2014-05-30 07:59 50226686 0.740909
# 2 0987654321 The sample was.. 50226686 0.740909
# 3 123 The sample was.. 50226686 0.740909
Upvotes: 1
Reputation: 700
Another way to solve this is as below
import pandas as pd
import re
df = pd.read_csv('data.csv')
#This function checks if the number is valid
def isCPRNUMvalid(val):
if len(val) == 10: #if string has length of 10
if val.isnumeric(): #if string is pure number
return True
return False
#This functoin checks if datetime is valid
#You can modify this to get date-time from string if present
def isREQSAMLETIMEvalid(st):
r = r'[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}' #regex to match
if re.match(r, st):
return True
return False
df['validCRPNUM'] = df['CPRNUM'].apply(isCPRNUMvalid)
df['validREQ_SAMPLETIME'] = df['REQ_SAMPLETIME'].apply(isREQSAMLETIMEvalid)
df = df[df.validCRPNUM] #if CRPNUM is TRUE
df = df[df.validREQ_SAMPLETIME] #if REQ_SAMPLETIME is TRUE
print(df)
Upvotes: 1