Skip operations on row if it is non numeric in pandas dataframe

Question

I have a dataframe:

import pandas as pd
df = pd.DataFrame({'start' : [5, 10, '$%%', 20], 'stop' : [10, 20, 30, 40]})
df['length_of_region'] = pd.Series([0 for i in range(0, len(df['start']))])

I want to calculate length of region only for non-zero numeric row values and skip function for the row with an error note if the value is not right. Here is what I have so far:

df['Notes'] = pd.Series(["" for i in range(0, len(df['region_name']))])

for i in range(0, len(df['start'])):
    if pd.isnull(df['start'][i]) == True:
        df['Notes'][i] += 'Error: Missing value for chromosome start at region %s, required value;' % (df['region_name'][i])
        df['critical_error'][i] = True
        num_error = num_error+1
    else:
        try:
            #print (df['start'][i]).isnumeric()
            start = int(df['start'][i])
            #print start
            #print df['start'][i]
            if start == 0:
                raise ValueError
        except:
            df['Notes'][i] += 'Error: Chromosome start should be a non zero number at region %s; ' % (df['region_name'][i])
            #print df['start'][i]
            df['critical_error'][i] = True
            num_error = num_error+1
for i in range(0, len(df['start'][i])):
    if df['critical_error'][i] == True:
        continue
    df['length_of_region'][i] = (df['stop'][i] - df['start'][i]) + 1.0

However, pandas converts df['start'] into a str variable and even if I use int to convert it, I get the following error:

df['length_of_region'][i] = (df['stop'][i] - df['start'][i]) + 1.0

TypeError: unsupported operand type(s) for -: 'numpy.int64' and 'str'

What am I missing here? Thanks for your time!

Manasi Shah · Accepted Answer

After staring at the code for quite some time, found a simple and elegant fix to reassign df['start'][i] to start that I use in try-except as follows:

for i in range(0, len(df['start'])):
    if pd.isnull(df['start'][i]) == True:
        df['Notes'][i] += 'Error: Missing value for chromosome start at region %s, required value;' % (df['region_name'][i])
        df['critical_error'][i] = True
        num_error = num_error+1
    else:
        try:
            start = int(df['start'][i])
            df['start'][i] = start
            if start == 0:
                raise ValueError
        except:
            df['Notes'][i] += 'Error: Chromosome start should be a non zero number at region %s; ' % (df['region_name'][i])
            #print df['start'][i]
            df['critical_error'][i] = True
            num_error = num_error+1
for i in range(0, len(df['start'][i])):
    if df['critical_error'][i] == True:
        continue
    df['length_of_region'][i] = (df['stop'][i] - df['start'][i]) + 1.0

Re-assigning the start variable, converts it into int format and helps to calculate length_of_region only for numeric columns

Skip operations on row if it is non numeric in pandas dataframe

Answers (2)

Related Questions