Dane Dane
Dane Dane

Reputation: 11

Cleaning data, deleting rows with any text in certain columns

i'm trying to do some date cleaning on some huge datasets, and I'm new to python (I've used google to search for my problem), so please bear over me with my terminology.

The data is imported from a CSV into a pandas.core.frame.DataFrame Some of my columns should only contain numbers and others only text:

        CPRNUM           REQ_SAMPLETIME     SAMPLE_ID   RESULT      

0       1234567890       2014-05-30 07:59   50226686    0.7409090909090907  
1       The sample was.. 2013-09-04 07:45   47721186    0.8290909090909093  
2       1234567890       The sample was..   46473016    1.0918181818181818  

I would really like to get rid of the rows within column REQ_SAMPLETIME and CPRNUM which is not 10 digits long, and contain text, so it would look like:

        CPRNUM           REQ_SAMPLETIME     SAMPLE_ID   RESULT      

0       1234567890       2014-05-30 07:59   50226686    0.7409090909090907  
3       0987654321       2018-06-10 05:32   12354678    3.7290909090909093  
4       1234567890       2013-09-04 07:45   15672687    5.9999951818181818  

Thanks for the help

Thanks to Danish Bansal, I used your code, as it fits my problem the most:

My final code looks like this:

hba1c = pd.read_csv(r"C:\sample.csv", encoding = 'unicode_escape', engine ='python', sep = ';')

#This function checks if the number is valid
def isCPRNUMvalid(val):
    if len(val) == 10: #if string has length of 10
        if val.isnumeric(): #if string is pure number 
            return True
    return False

hba1c['validCRPNUM'] = hba1c['CPRNUM'].apply(isCPRNUMvalid)
hba1c = hba1c[hba1c.validCRPNUM]
hba1c['Dates'] = pd.to_datetime(hba1c['REQ_SAMPLETIME']).dt.date

C:\Users\tphni\.conda\envs\py37\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.

hba1c.head()

I get the error message as typed above in the code, but it still seems to work, and the output is as excepted, with the correct number of fewer rows than the original file.

Upvotes: 1

Views: 325

Answers (2)

niko
niko

Reputation: 5281

I would suggest casting your columns to the desired format (e.g. numeric;datetime) and removing rows with problems:

v0 = df.CPRNUM.astype(str).str.len() == 10
v1 = pd.to_numeric(df.CPRNUM, errors="coerce").notna()
v2 = pd.to_datetime(df.REQ_SAMPLETIME, errors="coerce").notna()
filtered = df[v0 * v1 * v2]
print(filtered)

Explanation:

  • pd.to_<type> casts the input to the given type
  • errors="coerce" turns the problematic elements to NAs
  • .notna() gives a boolean vector with true where the value is not NA
  • * multiplying the boolean vectors gives us the vector corresponding to the and operator

Data

import pandas as pd

# Sample data
records = [
    dict(CPRNUM="1234567890", REQ_SAMPLETIME="2014-05-30 07:59", SAMPLE_ID=50226686, RESULT=0.7409090909090907),
    dict(CPRNUM="The sample was..", REQ_SAMPLETIME="2014-05-30 07:59", SAMPLE_ID=50226686, RESULT=0.7409090909090907),
    dict(CPRNUM="0987654321", REQ_SAMPLETIME="The sample was..", SAMPLE_ID=50226686, RESULT=0.7409090909090907),
    dict(CPRNUM="123", REQ_SAMPLETIME="The sample was..", SAMPLE_ID=50226686, RESULT=0.7409090909090907)
]
df = pd.DataFrame.from_records(records)
print(df)
#              CPRNUM    REQ_SAMPLETIME  SAMPLE_ID    RESULT
# 0        1234567890  2014-05-30 07:59   50226686  0.740909
# 1  The sample was..  2014-05-30 07:59   50226686  0.740909
# 2        0987654321  The sample was..   50226686  0.740909
# 3               123  The sample was..   50226686  0.740909

Upvotes: 1

Danish Bansal
Danish Bansal

Reputation: 700

Another way to solve this is as below

import pandas as pd
import re

df = pd.read_csv('data.csv')

#This function checks if the number is valid
def isCPRNUMvalid(val):
    if len(val) == 10: #if string has length of 10
        if val.isnumeric(): #if string is pure number 
            return True
    return False

#This functoin checks if datetime is valid
#You can modify this to get date-time from string if present
def isREQSAMLETIMEvalid(st):
    r = r'[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}' #regex to match
    if re.match(r, st):
        return True
    return False


df['validCRPNUM'] = df['CPRNUM'].apply(isCPRNUMvalid)
df['validREQ_SAMPLETIME'] = df['REQ_SAMPLETIME'].apply(isREQSAMLETIMEvalid)

df = df[df.validCRPNUM] #if CRPNUM is TRUE
df = df[df.validREQ_SAMPLETIME] #if REQ_SAMPLETIME is TRUE
print(df)

Upvotes: 1

Related Questions