Reputation: 71
I would like to convert Pandas Data Frame to string so that i can use in regex
Input Data:
SRAVAN
KUMAR
RAKESH
SOHAN
import re
import pandas as pd
file = spark.read.text("hdfs://test.txt")
pands = file.toPandas()
schema: pysark.sql.dataframe.DataFrame
result = re.sub(r"\n","",pands,0,re.MULTILINE)
print(result)
SRAVANKUMAR
RAKESHSOHAN
Upvotes: 2
Views: 8429
Reputation: 191983
You don't need Pandas for this. Spark has its own regex replace function.
This will replace \n
in every row with an empty string.
By default, spark.read.text
will read each line of the file into one dataframe row, so you cannot have a multi-line string value, anyway...
from pyspark.sql.functions import col, regexp_replace
df = spark.read.text("hdfs://test.txt")
df = df.select(regexp_replace(col('value'), '\n', ''))
df.show()
To get the dataframe into a joined string, collect the dataframe. But this should be avoided for large datasets.
s = '\n'.join(d['value'] for d in df.collect())
Upvotes: 2