Reputation: 7
How can I read a csv file from s3 without few values.
Eg: list [a,b]
Except the values a and b. I need to read all the other values in the csv. I know how to read the whole csv from s3. sqlContext.read.csv(s3_path, header=True)
but how do I exclude these 2 values from the file and read the rest of the file.
Upvotes: 0
Views: 418
Reputation: 269101
If you were wanting to get just a few rows, you could use S3 Select and Glacier Select – Retrieving Subsets of Objects | AWS News Blog. This is a way to run SQL against an S3 object without downloading it.
Alternatively, you could use Amazon Athena to query a CSV file using SQL.
However, it might simply be easier to download the whole file and do the processing locally in your Python app.
Upvotes: 0
Reputation: 77827
You don't. A file is a sequential storage medium. A CSV file is a form of text file: it's character-indexed. Therefore, to exclude columns, you have to first read and process the characters to find the column boundaries.
Even if you could magically find those boundaries, you would have to seek
past those locations; this would likely cost you more time than simply reading and ignoring the characters, since you would be interrupting the usual, smooth block-transfer instructions that drive most file buffering.
As the comments tell you, simply read the file as is and discard the unwanted data as part of your data cleansing. If you need the file repeatedly, then cleanse it once, and use that version for your program.
Upvotes: 1