How do I get certain columns from a dataset in Apache Spark (pyspark)?

Question

I have a dataset with four columns of data.

Ex:

a  b  c  d
1, 2, 3, 4

...

Using pyspark, how can I retrieve the data for columns a and b only? I'm new to spark and have tried many things including:

 dataset = data_raw.filter(lambda line: line != dataset_header) \
.map(lambda line: line.split(", ", maxsplit=2)).take(1)

But this doesn't seem to give me the required objective. All I want is to have columns a and b and discard the rest of the dataset. Any help would be much appreciated.

Grr · Accepted Answer

I am not sure that code would do what you are expecting had it worked. See the documentation for split for a little clarity. Here is a simple exapmle:

my_string = '1, 2, 3, 4'
result = my_string.split(', ', maxsplit=2)

print(result)
['1', '2', '3, 4']

As you can see you end up with three elements as you split on only the first two instances of ', '.

That little detail aside have you tried:

dataset = data_raw.filter(lambda line: line != dataset_header) \
    .map(lambda line: line.split(', ')[:2])

EDIT

In response to your comment, I just loaded a spark RDD with your example data and tested. Below is an image of the result.

EDIT2

Seeing as you noted that your data is in a csv, you can just use SparkSession.read.csv. Once you have the dataframe you can just select your columns:

df['a', 'b'].show(5)

Would show the first five rows.

How do I get certain columns from a dataset in Apache Spark (pyspark)?

Answers (2)

Related Questions