Reputation: 437
I have a dataframe:
import pandas as pd
df = pd.DataFrame({'start' : [5, 10, '$%%', 20], 'stop' : [10, 20, 30, 40]})
df['length_of_region'] = pd.Series([0 for i in range(0, len(df['start']))])
I want to calculate length of region only for non-zero numeric row values and skip function for the row with an error note if the value is not right. Here is what I have so far:
df['Notes'] = pd.Series(["" for i in range(0, len(df['region_name']))])
for i in range(0, len(df['start'])):
if pd.isnull(df['start'][i]) == True:
df['Notes'][i] += 'Error: Missing value for chromosome start at region %s, required value;' % (df['region_name'][i])
df['critical_error'][i] = True
num_error = num_error+1
else:
try:
#print (df['start'][i]).isnumeric()
start = int(df['start'][i])
#print start
#print df['start'][i]
if start == 0:
raise ValueError
except:
df['Notes'][i] += 'Error: Chromosome start should be a non zero number at region %s; ' % (df['region_name'][i])
#print df['start'][i]
df['critical_error'][i] = True
num_error = num_error+1
for i in range(0, len(df['start'][i])):
if df['critical_error'][i] == True:
continue
df['length_of_region'][i] = (df['stop'][i] - df['start'][i]) + 1.0
However, pandas
converts df['start']
into a str
variable and even if I use int
to convert it, I get the following error:
df['length_of_region'][i] = (df['stop'][i] - df['start'][i]) + 1.0
TypeError: unsupported operand type(s) for -: 'numpy.int64' and 'str'
What am I missing here? Thanks for your time!
Upvotes: 0
Views: 1692
Reputation: 437
After staring at the code for quite some time, found a simple and elegant fix to reassign df['start'][i]
to start
that I use in try-except
as follows:
for i in range(0, len(df['start'])):
if pd.isnull(df['start'][i]) == True:
df['Notes'][i] += 'Error: Missing value for chromosome start at region %s, required value;' % (df['region_name'][i])
df['critical_error'][i] = True
num_error = num_error+1
else:
try:
start = int(df['start'][i])
df['start'][i] = start
if start == 0:
raise ValueError
except:
df['Notes'][i] += 'Error: Chromosome start should be a non zero number at region %s; ' % (df['region_name'][i])
#print df['start'][i]
df['critical_error'][i] = True
num_error = num_error+1
for i in range(0, len(df['start'][i])):
if df['critical_error'][i] == True:
continue
df['length_of_region'][i] = (df['stop'][i] - df['start'][i]) + 1.0
Re-assigning the start variable, converts it into int
format and helps to calculate length_of_region
only for numeric columns
Upvotes: 0
Reputation: 2980
You can define a custom function to do the calculation then apply that function to each row.
def calculate_region_length(x):
start_val = x[0]
stop_val = x[1]
try:
start_val = float(start_val)
return (stop_val - start_val) + 1.0
except ValueError:
return None
The custom function accepts a list as input. The function will test the start value to see if it can be converted into a float. If it cannot then None
will be returned. This way if '1' is stored as a string the value can still be converted to float and won't be skipped whereas '$%%' in your example cannot and will return None
.
Next you call the custom function for each row:
df['length_of_region'] = df[['start', 'stop']].apply(lambda x: calculate_region_legnth(x), axis=1)
This will create your new column with (stop - start) + 1.0
for rows where start
is not a non-convertible string and None
where start
is a string that cannot be converted to a number.
You can then update the Notes
field based on rows where None
is returned to identify the regions where a start value is missing:
df.loc[df['length_of_region'].isnull(), 'Notes'] = df['region_name']
Upvotes: 1