Reputation: 87
I'm attempting to write the results of a regression back to MySQL, but am having problems iterating through the fitted values and getting the NaNs to write as null values. Originally, I did the iteration this way:
for i in dataframe:
cur = cnx.cursor()
query = ("UPDATE Regression_Data.Input SET FITTEDVALUES="+(dataframe['yhat'].__str__())+" where timecount="+(datafrane['timecount'].__str__())+";")
cur.execute(query)
cnx.commit()
cur.close()
.....which SQL thew back to me by saying:
"mysql.connector.errors.ProgrammingError: 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'NaN'
So, I've been trying to filter out the NaNs by only asking Python to commit when yhat does not equal NaN:
for i in dataframe:
if cleandf['yhat']>(-1000):
cur = cnx.cursor()
query = ("UPDATE Regression_Data.Input SET FITTEDVALUES="+(dataframe['yhat'].__str__())+" where timecount="+(datafrane['timecount'].__str__())+";")
cur.execute(query)
cnx.commit()
cur.close()
But then I get this:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
So, I try to get around it with this in my above syntax:
if cleandf['yhat'][i]>(-1000):
but then get this:
ValueError: Can only tuple-index with a MultiIndex
And then tried adding itterows() to both as in:
for i in dataframe.iterrows():
if cleandf['yhat'][i]>(-1000):
but get the same problems as above.
I'm not sure what I'm doing wrong here, but assume it's something with iterating in Pandas DataFrames. But, even if I got the iteration right, I would want to write Nulls into SQL where the NaN appeared.
So, how do you think I should do this?
Upvotes: 1
Views: 2674
Reputation: 3509
I don't have a complete answer, but perhaps I have some tips that might help. I believe you are thinking of your dataframe
as an object similar to a SQL record set.
for i in dataframe
This will iterate over the column name strings in the dataframe. i
will take on column names, not rows.
dataframe['yhat']
This returns an entire column (pandas.Series
, which is a numpy.ndarray
), not a single value. Therefore:
dataframe['yhat'].__str__()
will give a string representation of an entire column that is useful for humans to read. It is certainly not a single value that can be converted to string for your query.
if cleandf['yhat']>(-1000)
This gives an error, because again, cleandf['yhat']
is an entire array of values, not just a single value. Think of it as an entire column, not the value from a single row.
if cleandf['yhat'][i]>(-1000):
This is getting closer, but you really want i
to be an integer here, not another column name.
for i in dataframe.iterrows():
if cleandf['yhat'][i]>(-1000):
Using iterrows
seems like the right thing for you. However, i
takes on the value of each row, not an integer that can index into a column (cleandf['yhat']
is a full column).
Also, note that pandas has better ways to check for missing values than relying on a huge negative number. Try something like this:
non_missing_index = pandas.isnull(dataframe['yhat'])
cleandf = dataframe[non_missing_index]
for row in cleandf.iterrows():
row_index, row_values = row
query = ("UPDATE Regression_Data.Input SET FITTEDVALUES="+(row_values['yhat'].__str__())+" where timecount="+(row_values['timecount'].__str__())+";")
execute_my_query(query)
You can implement execute_my_query
better than I can, I expect. However, this solution is not quite what you want. You really want to iterate over all rows and do two types of inserts. Try this:
for row in dataframe.iterrows():
row_index, row_values = row
if pandas.isnull(row_values['yhat']):
pass # populate the 'null' insert query here
else:
query = ("UPDATE Regression_Data.Input SET FITTEDVALUES="+(row_values['yhat'].__str__())+" where timecount="+(row_values['timecount'].__str__())+";")
execute_my_query(query)
Hope it helps.
Upvotes: 3