Saikishore Namburi
Saikishore Namburi

Reputation: 41

Load a variable into a dataframe

In PySpark, I am trying to load a dataframe from a string variable.

My variable is a multi line text..

string_data = """
 Name|age|city
 david|23|London
 krish|24|Bali
 john|56|Goa
"""

I wanted to load this data into a dataframe in PySpark. Thought of using datasets but they are not available in PySpark.

Using Pandas, I used to write like this:

string2 = StringIO(string_data)

df = pd.read_csv(string2,sep='|')

Upvotes: 0

Views: 1278

Answers (1)

mck
mck

Reputation: 42352

You can split the string by newline characters, parallelize that into an RDD, and feed that into spark.read.csv.

df = spark.read.csv(sc.parallelize(string_data.split('\n')), sep='|', header=True)

df.show() 
+-----+---+------+
| Name|age|  city|
+-----+---+------+
|david| 23|London|
|krish| 24|  Bali|
| john| 56|   Goa|
+-----+---+------+

Upvotes: 1

Related Questions