Ajay Santhosh
Ajay Santhosh

Reputation: 19

How to select a particular column from a CSV in pyspark?

I read a CSV file in Pyspark

inputRDD1 = sc.textFile('a.csv')

Data:

a b
1 1
2 3

I want to select the column'b' so that I can do manipulations on it like mean etc. But How do I select the column? I checked for many tutorials couldn't find one.

Please let me know,

Thank you.

i am trying to find the unique elements of a column.

I tried this:

newrdd = inputRDD1.map(lambda x: x[[2]) 

Not able to select the column 'b'

Upvotes: 0

Views: 16186

Answers (2)

Alex
Alex

Reputation: 21766

Please see raj's answer as it is more complete. I provided my solution as it might be easier to understand for a beginner.

sc.textfile reads the entire line and therefore there is only one column in your inputRDD. You will need to split your input lines first by your delimitation character(either a space or a tab). Once you have done that, you can select the relevant column you are after

inputRDD1 = sc.parallelize(['a b','1 1','2 3'])
newrdd = inputRDD1 .map(lambda x: x.split( )[1]) 
newrdd.collect()

gives

['b', '1', '3']

Upvotes: 3

Rajnish Kumar
Rajnish Kumar

Reputation: 2938

Hi to select a particular column from a RDD in Python, please do it like below

Sample Data (tab seperated)

enter image description here

from pyspark.conf import SparkConf
from pyspark.context import SparkContext

# creating spark context
conf = SparkConf().setAppName("SelectingColumn").setMaster("local[*]")
spark = SparkContext(conf = conf)

# calling data 
raw_data = spark.textFile("C:\\Users...\\SampleCsv.txt", 1)

# custom method to return column b data only
def parse_data(line):
    fields = line.split("\t")
    # use 0 for column 1, 2 for column 2 and so on
    return fields[1]

columnBdata = raw_data.map(parse_data)
print(columnBdata.take(4)) # yields column b data only

Output ['b', '2', '7', '12']

Upvotes: 2

Related Questions