Minu
Minu

Reputation: 7

Read a csv file from s3 excluding some values

How can I read a csv file from s3 without few values.

Eg: list [a,b] Except the values a and b. I need to read all the other values in the csv. I know how to read the whole csv from s3. sqlContext.read.csv(s3_path, header=True) but how do I exclude these 2 values from the file and read the rest of the file.

Upvotes: 0

Views: 418

Answers (2)

John Rotenstein
John Rotenstein

Reputation: 269101

If you were wanting to get just a few rows, you could use S3 Select and Glacier Select – Retrieving Subsets of Objects | AWS News Blog. This is a way to run SQL against an S3 object without downloading it.

Alternatively, you could use Amazon Athena to query a CSV file using SQL.

However, it might simply be easier to download the whole file and do the processing locally in your Python app.

Upvotes: 0

Prune
Prune

Reputation: 77827

You don't. A file is a sequential storage medium. A CSV file is a form of text file: it's character-indexed. Therefore, to exclude columns, you have to first read and process the characters to find the column boundaries.

Even if you could magically find those boundaries, you would have to seek past those locations; this would likely cost you more time than simply reading and ignoring the characters, since you would be interrupting the usual, smooth block-transfer instructions that drive most file buffering.

As the comments tell you, simply read the file as is and discard the unwanted data as part of your data cleansing. If you need the file repeatedly, then cleanse it once, and use that version for your program.

Upvotes: 1

Related Questions