Reputation: 3153
I am working on AWS Glue, and so I cannot use Pandas/Numpy, etc currently.
I have a dataframe of records, which I need to process and update to mysql database. I need to check for record availability and already exists, do an insert...on duplicate key. For this reason, I need to loop through the dataframe using native python libraries. All dataframe iterators I found were using pandas, but is there a way to do without pandas?
Please find herewith a sample dataframe:
df1 = sqlContext.createDataFrame([
('4001','81A01','Portland, ME','NY'),
('4002','44444','Portland, ME','NY'),
('4022','33333','BANGALORE','KA'),
('5222','88888','CHENNAI','TN')],
("zip_code_new", "territory_code_new", "territory_name_new", "state_new"))
I tried the following, but i got an error message, "AttributeError: 'DataFrame' object has no attribute 'values'"
for i in df1.values():
print i
UPDATE: The following code seem to work with native python to loop through the dataframe. Also, psidom's code also should work, but i could not see the print results.
arr = df1.collect()
for r in arr:
print r.zip_code_new
Thanks
Upvotes: 0
Views: 688
Reputation: 214957
You don't use for loop on spark data frame; It has foreach
method to loop through rows; for example, we can print the zip_code_new
in each row as follows:
def process_row(r):
# your sql statement may go here
print('zip_code_new: ', r.zip_code_new)
df1.foreach(process_row)
#('zip_code_new: ', u'4002')
#('zip_code_new: ', u'5222')
#('zip_code_new: ', u'4022')
#('zip_code_new: ', u'4001')
Upvotes: 1