Pyspark 'for' loop not filtering correctly a pyspark-sql dataframe using .filter()

Question

I am trying to create a for loop i which I first: filter a pyspark sql dataframe, then transform the filtered dataframe to pandas, apply a function to it and yied the result in a list called results. My list contains a sequence of strings (that will be sort of ids in the dataframe); I want the for loop to, in each iteration, obtain one of the strings from the list, and filter all the rows in the dataframe whose id is that string. Sample code:

results = []
for x in list: 
    aux = df.filter("id='x'") 
    final= function(aux,"value") 
    results.append(final)
results

The dataframe is a time-series, and outside the loop I apply the aux = df.filter("id='x'") transformation and then the function runs without problem; the issue is in the loop itself. However, when I do aux.show() it shows an empty dataframe. The dataframe is a time-series, and outside the loop I apply the aux = df.filter("id='x'") transformation and then the function runs without problem; the issue is in the loop itself.

Does anyone know why this may be happening?

mck · Accepted Answer

Try the code below. x is not substituted in the filter expression.

results = []
for x in list: 
    aux = df.filter("id = '%s'" % x) 
    final= function(aux,"value") 
    results.append(final)
results

Pyspark 'for' loop not filtering correctly a pyspark-sql dataframe using .filter()

Answers (1)

Related Questions

Pyspark &#39;for&#39; loop not filtering correctly a pyspark-sql dataframe using .filter()

Answers (1)

Related Questions

Pyspark 'for' loop not filtering correctly a pyspark-sql dataframe using .filter()