xn139
xn139

Reputation: 395

How do I get certain columns from a dataset in Apache Spark (pyspark)?

I have a dataset with four columns of data.

Ex:

a  b  c  d
1, 2, 3, 4

...

Using pyspark, how can I retrieve the data for columns a and b only? I'm new to spark and have tried many things including:

 dataset = data_raw.filter(lambda line: line != dataset_header) \
.map(lambda line: line.split(", ", maxsplit=2)).take(1)

But this doesn't seem to give me the required objective. All I want is to have columns a and b and discard the rest of the dataset. Any help would be much appreciated.

Upvotes: 2

Views: 3989

Answers (2)

Grr
Grr

Reputation: 16079

I am not sure that code would do what you are expecting had it worked. See the documentation for split for a little clarity. Here is a simple exapmle:

my_string = '1, 2, 3, 4'
result = my_string.split(', ', maxsplit=2)

print(result)
['1', '2', '3, 4']

As you can see you end up with three elements as you split on only the first two instances of ', '.

That little detail aside have you tried:

dataset = data_raw.filter(lambda line: line != dataset_header) \
    .map(lambda line: line.split(', ')[:2])

EDIT

In response to your comment, I just loaded a spark RDD with your example data and tested. Below is an image of the result.

pyspark map example

EDIT2

Seeing as you noted that your data is in a csv, you can just use SparkSession.read.csv. Once you have the dataframe you can just select your columns:

df['a', 'b'].show(5)

Would show the first five rows.

enter image description here

Upvotes: 1

Amit Darji
Amit Darji

Reputation: 457

Have you try out Select method for selecting only two column..

dataset.select('a','b').show()

I think you should use csv reader for tour dataset.

sc.textFile("file.csv") \
.map(lambda line: line.split(",")) \
.filter(lambda line: len(line)<=1) \
.collect()

Upvotes: 2

Related Questions