sa_n
sa_n

Reputation: 23

How to remove duplicate numbers from text file using pyspark python

I'm trying to remove duplicate numbers from a text file using pyspark python, but the operation applies only to row. e.g my text file is:

3  
66  
4  
9  
3  
23 

Below is the code that i tried:

import pyspark
from pyspark import SparkContext, SparkConf
from collections import OrderedDict
sc = SparkContext.getOrCreate()
data = sc.textfile('file.txt')
new_data = data.map(lambda x: list(OrderedDict.fromkeys(x)))
new_data.collect()

I get the output as: [['3'], ['6'], ['4'], ['9'], ['3'],['2','3'] ]

But I want: [3, 66, 4, 9, 23]

Upvotes: 0

Views: 508

Answers (2)

Bhanurdra
Bhanurdra

Reputation: 1

I assume, you are reading a text file with a single columnar data having only numbers as you have shown. Here are few possible solutions.

1.Removing duplicates

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

df = spark.read.text("file.txt").drop_duplicates()
df.show()

2.If you would like to target specific columnar position of the text file, create a new column for that position and apply the same process.

from pyspark.sql import SparkSession
from pyspark.sql.functions import *
spark = SparkSession.builder.getOrCreate()

df = df.withColumn("col1", col('value').substr(starting_position, length))
df.select("col1").drop_duplicates().show()

Upvotes: 0

OneCricketeer
OneCricketeer

Reputation: 191831

You're mapping a dict function over all entries, which will return a RDD with entries that contain collections.

To simply get unique rows of a dataframe, use distinct()

import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder\
       .master("local")\
       .appName("Unique Example")\
       .getOrCreate()

df = spark.read.text("file.txt")
df.distinct().show()

Note that this uses SparkSQL DataFrame API, which is the preferred mode of operation for most actions, compared to your code which uses RDDs, which also have a distinct function

Upvotes: 1

Related Questions