Reputation: 2811
Most of the questions about spark are used show
as code example without the code that generates the dataframe, like this:
df.show()
+-------+--------+----------+
|USER_ID|location| timestamp|
+-------+--------+----------+
| 1| 1001|1265397099|
| 1| 6022|1275846679|
| 1| 1041|1265368299|
+-------+--------+----------+
How can I reproduce this code in my programming environment without rewriting it manually? pyspark have some equivalent of read_clipboard
in pandas?
The lack of a function to import data into my environment is a big obstacle for me to help others with pyspark in Stackoverflow.
So my question is:
What is the most convenient way to reproduce data pasted in stackoverflow from show
command into my environment?
Upvotes: 3
Views: 1156
Reputation: 4540
Late answer, but I often face the same issue so wrote a small utility for this https://github.com/ollik1/spark-clipboard
It basically allows copy-pasting data frame show strings to spark. To install it, add jcenter dependency com.github.ollik1:spark-clipboard_2.12:0.1
and spark config .config("fs.clipboard.impl", "com.github.ollik1.clipboard.ClipboardFileSystem")
After this, data frames can be read directly from the system clipboard
val df = spark.read
.format("com.github.ollik1.clipboard")
.load("clipboard:///*")
or alternatively files if you prefer. Installation details and usage are described in the read me file.
Upvotes: 2
Reputation: 40370
You can always use the following function :
from pyspark.sql.functions import *
def read_spark_output(file_path):
step1 = spark.read \
.option("header","true") \
.option("inferSchema","true") \
.option("delimiter","|") \
.option("parserLib","UNIVOCITY") \
.option("ignoreLeadingWhiteSpace","true") \
.option("ignoreTrailingWhiteSpace","true") \
.option("comment","+") \
.csv("file://{}".format(file_path))
# select not-null columns
step2 = t.select([c for c in t.columns if not c.startswith("_")])
# deal with 'null' string in column
return step2.select(*[when(~col(col_name).eqNullSafe("null"), col(col_name)).alias(col_name) for col_name in step2.columns])
It's one of the suggestions given in the following question : How to make good reproducible Apache Spark examples.
Note 1: Sometimes, there might be special cases where this might not apply for some reason or the other and which can generate in errors/issues i.e Group by column "grp" and compress DataFrame - (take last not null value for each column ordering by column "ord"). So please use it with caution !
Note 2: (Disclaimer) I'm not the original author of the code. Thanks to @MaxU for the code. I just made some modifications on it.
Upvotes: 2
Reputation: 2695
You can combine panda read_clipboard, and convert to pyspark dataframe
from pyspark.sql.types import *
pdDF = pd.read_clipboard(sep=',',
index_col=0,
names=['USER_ID',
'location',
'timestamp',
])
mySchema = StructType([ StructField("USER_ID", StringType(), True)\
,StructField("location", LongType(), True)\
,StructField("timestamp", LongType(), True)])
#note: True (implies nullable allowed)
df = spark.createDataFrame(pdDF,schema=mySchema)
Update:
What @terry really want is copy ASCII code table to python , and following is example. When you parse data into python , then you can convert to anything.
def parse(ascii_table):
header = []
data = []
for line in filter(None, ascii_table.split('\n')):
if '-+-' in line:
continue
if not header:
header = filter(lambda x: x!='|', line.split())
continue
data.append(['']*len(header))
splitted_line = filter(lambda x: x!='|', line.split())
for i in range(len(splitted_line)):
data[-1][i]=splitted_line[i]
return header, data
Upvotes: 0
Reputation: 936
You can always read the data in pandas as a pandas dataframe and then convert it back to a spark dataframe. No, there is not a direct equivalent of read_clipboard in pyspark unlike pandas.
The reason is that Pandas dataframes are mostly flat structures where as spark dataframes can have complex structures like struct, arrays etc, since it has a wide variety of data types and those doesn't appear on console output, it is not possible to recreate the dataframe from the output.
Upvotes: 1