kmr
kmr

Reputation: 71

Convert Data Frame to string in pyspark

I would like to convert Pandas Data Frame to string so that i can use in regex

Input Data:

SRAVAN
KUMAR
RAKESH
SOHAN

import re

import pandas as pd

file = spark.read.text("hdfs://test.txt")

pands = file.toPandas()

schema: pysark.sql.dataframe.DataFrame

result = re.sub(r"\n","",pands,0,re.MULTILINE)

print(result)

SRAVANKUMAR
RAKESHSOHAN

Upvotes: 2

Views: 8429

Answers (1)

OneCricketeer
OneCricketeer

Reputation: 191983

You don't need Pandas for this. Spark has its own regex replace function.

This will replace \n in every row with an empty string.

By default, spark.read.text will read each line of the file into one dataframe row, so you cannot have a multi-line string value, anyway...

from pyspark.sql.functions import col, regexp_replace

df = spark.read.text("hdfs://test.txt")
df = df.select(regexp_replace(col('value'), '\n', ''))
df.show()

To get the dataframe into a joined string, collect the dataframe. But this should be avoided for large datasets.

s = '\n'.join(d['value'] for d in df.collect())

Upvotes: 2

Related Questions