vikrant rana
vikrant rana

Reputation: 4676

skip lines from csv file if it contains specific keyword in Pyspark

I have a CSV file with the details as shown below:

emp_id,emp_name,emp_city,emp_salary
1,VIKRANT SINGH RANA    ,NOIDA   ,10000
3,GOVIND NIMBHAL        ,DWARKA  ,92000
2,RAGHVENDRA KUMAR GUPTA,GURGAON ,50000
4,ABHIJAN SINHA         ,SAKET   ,65000
5,SUPER DEVELOPER       ,USA     ,50000
6,RAJAT TYAGI           ,UP      ,65000
7,AJAY SHARMA           ,NOIDA   ,70000
8,SIDDHARTH BASU        ,SAKET   ,72000
9,ROBERT                ,GURGAON ,70000
9,ABC                   ,ROBERT  ,10000
9,XYZ                   ,ROBERTGURGAON,70000

I want to skip the lines if it contains keyword "ROBERT" and the expecting output as:

+------+--------------------+-------------+----------+
|emp_id|            emp_name|     emp_city|emp_salary|
+------+--------------------+-------------+----------+
|     1|VIKRANT SINGH RAN...|     NOIDA   |     10000|
|     3|GOVIND NIMBHAL   ...|     DWARKA  |     92000|
|     2|RAGHVENDRA KUMAR ...|     GURGAON |     50000|
|     4|ABHIJAN SINHA    ...|     SAKET   |     65000|
|     5|SUPER DEVELOPER  ...|     USA     |     50000|
|     6|RAJAT TYAGI      ...|     UP      |     65000|
|     7|AJAY SHARMA      ...|     NOIDA   |     70000|
|     8|SIDDHARTH BASU   ...|     SAKET   |     72000|
+------+--------------------+-------------+----------+

I can load this file into dataframe and can filter using below expression for each column

newdf = emp_df.where(~ col("emp_city").like("ROBERT%"))

I am looking for some solution so that I can filter it before loading it into dataframe and need not to traversed all the columns to find the specific string.

Upvotes: 0

Views: 119

Answers (1)

vikrant rana
vikrant rana

Reputation: 4676

I was able to filter it using RDD.

textdata = sc.textFile(PATH_TO_FILE)
header=textdata.first();
textnewdata = textdata.filter(lambda x:x != header)
newRDD = textnewdata.filter(lambda row : 'ROBERT' not in row)

[u'1,VIKRANT SINGH RANA    ,NOIDA   ,10000', 
u'3,GOVIND NIMBHAL        ,DWARKA  ,92000', 
u'2,RAGHVENDRA KUMAR GUPTA,GURGAON ,50000', 
u'4,ABHIJAN SINHA         ,SAKET   ,65000', 
u'5,SUPER DEVELOPER       ,USA     ,50000', 
u'6,RAJAT TYAGI           ,UP      ,65000', 
u'7,AJAY SHARMA           ,NOIDA   ,70000', 
u'8,SIDDHARTH BASU        ,SAKET   ,72000']

newsplitRDD = newRDD.map(lambda l: l.split(","))

newDF = newsplitRDD.toDF()

>>> newDF.show();
+---+--------------------+--------+-----+
| _1|                  _2|      _3|   _4|
+---+--------------------+--------+-----+
|  1|VIKRANT SINGH RAN...|NOIDA   |10000|
|  3|GOVIND NIMBHAL   ...|DWARKA  |92000|
|  2|RAGHVENDRA KUMAR ...|GURGAON |50000|
|  4|ABHIJAN SINHA    ...|SAKET   |65000|
|  5|SUPER DEVELOPER  ...|USA     |50000|
|  6|RAJAT TYAGI      ...|UP      |65000|
|  7|AJAY SHARMA      ...|NOIDA   |70000|
|  8|SIDDHARTH BASU   ...|SAKET   |72000|
+---+--------------------+--------+-----+

Upvotes: 2

Related Questions