How to select a particular column from a CSV in pyspark?

Question

I read a CSV file in Pyspark

inputRDD1 = sc.textFile('a.csv')

Data:

a b
1 1
2 3

I want to select the column'b' so that I can do manipulations on it like mean etc. But How do I select the column? I checked for many tutorials couldn't find one.

Please let me know,

Thank you.

i am trying to find the unique elements of a column.

I tried this:

newrdd = inputRDD1.map(lambda x: x[[2])

Not able to select the column 'b'

Alex · Accepted Answer

Please see raj's answer as it is more complete. I provided my solution as it might be easier to understand for a beginner.

sc.textfile reads the entire line and therefore there is only one column in your inputRDD. You will need to split your input lines first by your delimitation character(either a space or a tab). Once you have done that, you can select the relevant column you are after

inputRDD1 = sc.parallelize(['a b','1 1','2 3'])
newrdd = inputRDD1 .map(lambda x: x.split( )[1]) 
newrdd.collect()

gives

['b', '1', '3']

How to select a particular column from a CSV in pyspark?

Answers (2)

Sample Data (tab seperated)

Related Questions