Reputation: 395
I have a dataset with four columns of data.
Ex:
a b c d
1, 2, 3, 4
...
Using pyspark, how can I retrieve the data for columns a
and b
only? I'm new to spark and have tried many things including:
dataset = data_raw.filter(lambda line: line != dataset_header) \
.map(lambda line: line.split(", ", maxsplit=2)).take(1)
But this doesn't seem to give me the required objective. All I want is to have columns a
and b
and discard the rest of the dataset. Any help would be much appreciated.
Upvotes: 2
Views: 3989
Reputation: 16079
I am not sure that code would do what you are expecting had it worked. See the documentation for split
for a little clarity. Here is a simple exapmle:
my_string = '1, 2, 3, 4'
result = my_string.split(', ', maxsplit=2)
print(result)
['1', '2', '3, 4']
As you can see you end up with three elements as you split on only the first two instances of ', '.
That little detail aside have you tried:
dataset = data_raw.filter(lambda line: line != dataset_header) \
.map(lambda line: line.split(', ')[:2])
EDIT
In response to your comment, I just loaded a spark RDD with your example data and tested. Below is an image of the result.
EDIT2
Seeing as you noted that your data is in a csv, you can just use SparkSession.read.csv
. Once you have the dataframe you can just select your columns:
df['a', 'b'].show(5)
Would show the first five rows.
Upvotes: 1
Reputation: 457
Have you try out Select
method for selecting only two column..
dataset.select('a','b').show()
I think you should use csv reader for tour dataset.
sc.textFile("file.csv") \
.map(lambda line: line.split(",")) \
.filter(lambda line: len(line)<=1) \
.collect()
Upvotes: 2