Reputation: 41
In PySpark, I am trying to load a dataframe from a string variable.
My variable is a multi line text..
string_data = """
Name|age|city
david|23|London
krish|24|Bali
john|56|Goa
"""
I wanted to load this data into a dataframe in PySpark. Thought of using datasets but they are not available in PySpark.
Using Pandas, I used to write like this:
string2 = StringIO(string_data)
df = pd.read_csv(string2,sep='|')
Upvotes: 0
Views: 1278
Reputation: 42352
You can split the string by newline characters, parallelize that into an RDD, and feed that into spark.read.csv
.
df = spark.read.csv(sc.parallelize(string_data.split('\n')), sep='|', header=True)
df.show()
+-----+---+------+
| Name|age| city|
+-----+---+------+
|david| 23|London|
|krish| 24| Bali|
| john| 56| Goa|
+-----+---+------+
Upvotes: 1