celllaa95
celllaa95

Reputation: 35

How to get dataFrame array value in a empty python array

I am working with databricks dataframe(pyspark)

I have a dataframe that contains a array with string value.

I need to use the df value to assemble with value from a python array that i have.

What i want is to put the df value in a python array like this:

listArray = []

listArray.append(dataframeArrayValue)

print(listArray)
outPut:
     [value1, value2, value3]

The problem I get is that it kind off work, but for some reason I can not work with the string value that is added to the new array list(listArray).

My concept is that i am gonna build a url, where i need to use SQL to get the begining information of that url. That first part is what i put in the df array. For the last part off the url, i have that stored in a python array.

I want to loop through both array, and put the result in a empty array.

Something like this:

display(dfList)
outPut:
      [dfValue1, dafValue2, dfValue3]

print(pyList)
      [pyValue1, pyValue2, pyValue3]

Whant to put them together like this:

dfValue1 + pyValue2 etc..

And getting a array like this:

newArrayContainingBoth = []

-- loop with append

result:

print(newArrayContainingBoth)
outPut:
[dfValue1+pyValue1, dfValue2+pyValue2, dfValue3+pyValue]

Hope my question was clear enough

Upvotes: 1

Views: 185

Answers (1)

pvy4917
pvy4917

Reputation: 1822

Try this steps,

  • You can use explode() to get a string from that array. Then,
  • collect() as list,
  • Extract string part from the Row,
  • split() by a comma (",").
  • Finally, use it.

First import explode(),

from pyspark.sql.functions import explode 

Assuming your context in DataFrame "df"

columns = ['nameOffjdbc', 'some_column']
rows = [
        (['/file/path.something1'], 'value1'),
        (['/file/path.something2'], 'value2')
        ]

df = spark.createDataFrame(rows, columns)
df.show(2, False)
+-----------------------+-----------+
|nameOffjdbc            |some_column|
+-----------------------+-----------+
|[/file/path.something1]|value1     |
|[/file/path.something2]|value2     |
+-----------------------+-----------+

Select the column nameOffjdbc from DataFrame 'df'

dfArray = df.select('nameOffjdbc')
print(dfArray)
DataFrame[nameOffjdbc: array<string>]

Explode the column nameOffjdbc

dfArray = dfArray.withColumn('nameOffjdbc', explode('nameOffjdbc'))
dfArray.show(2, False)
+---------------------+
|nameOffjdbc          |
+---------------------+
|/file/path.something1| 
|/file/path.something2|
+---------------------+

Now collect it to newDfArray (This is a python list that you need).

newDfArray = dfArray.collect()
print(newDfArray)
[Row(nameOffjdbc=u'/file/path.something1'), 
     Row(nameOffjdbc=u'/file/path.something2')]

Since, it is (will be) in the format [Row(column)=u'value']. We need to get the value (string) part of it. hence,

pyList = ",".join(str('{0}'.format(value.nameOffjdbc)) for value in newDfArray)
print(pyList, type(pyList))
('/file/path.something1,/file/path.something2', <type 'str'>)

Split the value by a comma ",", which will create a list out of a string.

pyList = pyList.split(',')
print(pyList, type(pyList))
(['/file/path.something1', '/file/path.something2'], <type 'list'>)

Use it

print(pyList[0])
/file/path.something1

print(pyList[1])
/file/path.something2

If you want to loop

for items in pyList:
    print(items)
/file/path.something1
/file/path.something2

In a nut shell the following code is all you need.

columns = ['nameOffjdbc', 'some_column']
rows = [
    (['/file/path.something1'], 'value1'),
    (['/file/path.something2'], 'value2')
    ]
df = spark.createDataFrame(rows, columns)

dfArray = df.select('nameOffjdbc')

dfArray = dfArray.withColumn('nameOffjdbc', explode('nameOffjdbc')).collect()
pyList = ",".join(str('{0}'.format(value.nameOffjdbc)) for value in dfArray).split(',')

NOTE: collect() always collects a DataFrame values into a list.

For more information, refer:

Upvotes: 1

Related Questions