Ryan
Ryan

Reputation: 49

unable to grab particular columns from a CSV file

I am trying to access contents of a CSV file and parse it. I just need two columns out of entire CSV file . I can access the CSV and its contents but I need to limit it to the columns I need so that I can use the details from that columns

import os
import boto3
import pandas as pd
import sys
from io import StringIO # Python 3.x
session = boto3.session.Session(profile_name="rli-prod",region_name="us-east-1")
client = session.client("s3")
bucket_name = 'bucketname'
object_key = 'XX/YY/ZZ.csv'
csv_obj = client.get_object(Bucket=bucket_name, Key=object_key)
body = csv_obj['Body']
csv_string = body.read().decode('utf-8-sig')
df = pd.read_csv(StringIO(csv_string))
print(df)

Right now, I am getting the entire CSV. Below is the output

0  63a2a854-a136-4bb1-a89b-a4e638b2be14  8128639b-a163-4e8e-b1f8-22e3dcd2b655  ...                123  63a2a854-a136-4bb1-a89b-a4e638b2be14
1  63a2a854-a136-4bb1-a89b-a4e638b2be14  8d6bdc73-f908-45d8-8d8a-c3ac0bee3b29  ...                123  63a2a854-a136-4bb1-a89b-a4e638b2be14
2  63a2a854-a136-4bb1-a89b-a4e638b2be14  1312e6f6-4c5f-4fa5-babd-93a3c0d3b502  ...                234  63a2a854-a136-4bb1-a89b-a4e638b2be14
3  63a2a854-a136-4bb1-a89b-a4e638b2be14  bfec5ccc-4449-401d-9898-9c523b1e1230  ...                456  63a2a854-a136-4bb1-a89b-a4e638b2be14
4  63a2a854-a136-4bb1-a89b-a4e638b2be14  522a72f0-2746-417c-9a59-fae4fb1e07d7  ...                567  63a2a854-a136-4bb1-a89b-a4e638b2be14

[5 rows x 9 columns]

Right now, My CSV doesnot have any headers , so only option I have is to grab using column number. But am not sure how to do that? Can anyone please assist?

Upvotes: 0

Views: 75

Answers (4)

Joe
Joe

Reputation: 889

Option 1:

If you already read the csv and want to do the dropping of other columns mid calculation. Use the index of which columns you want to use inside df.iloc.

Example:

>>> df                         #sample dataframe I want to get the first 2 columns only
        Artist  Count  Test
0  The Beatles      4     1
1  Some Artist      2     1
2  Some Artist      2     1
3  The Beatles      4     1
4  The Beatles      4     1
5  The Beatles      4     1
>>> df3 = df.iloc[:,[0,1]]
>>> df3
        Artist  Count
0  The Beatles      4
1  Some Artist      2
2  Some Artist      2
3  The Beatles      4
4  The Beatles      4
5  The Beatles      4

Option 2

During the reading of the file itself, specify which columns to use under the parameter usecols of read_csv().

df = pd.read_csv(StringIO(csv_string), usecols = [place column index here])

Upvotes: 2

Raj Josyula
Raj Josyula

Reputation: 150

In [15]: import pandas as pd

In [16]: d1 = {"col1" : "value11", "col2": "value21", "col3": "value31"}

In [17]: d2 = {"col1" : "value12", "col2": "value22", "col3": "value32"}

In [18]: d3 = {"col1" : "value13", "col2": "value23", "col3": "value33"}

In [19]: df = df.append(d1, ignore_index=True, verify_integrity=True, sort=False)

In [20]: df = df.append(d2, ignore_index=True, verify_integrity=True, sort=False)

In [21]: df = df.append(d3, ignore_index=True, verify_integrity=True, sort=False)

In [22]: df
Out[22]:
      col1     col2     col3
0  value11  value21  value31
1  value12  value22  value32
2  value13  value23  value33
3  value11  value21  value31
4  value12  value22  value32
5  value13  value23  value33

In [23]: # Selecting only col1 and col3

In [24]: df_new = df[["col1", "col3"]]

In [25]: df_new
Out[25]:
      col1     col3
0  value11  value31
1  value12  value32
2  value13  value33
3  value11  value31
4  value12  value32
5  value13  value33

In [26]:

Upvotes: 0

Alvaro Cuervo
Alvaro Cuervo

Reputation: 104

strong textUse read_csv method from pandas library:

import pandas as pd

data = pd.read_csv('file.csv', usecols=[2, 4])   
print(data.head())

The parameter usecols accepts the name of the column or index as a list

Upvotes: 1

mgrollins
mgrollins

Reputation: 651

Since you are already utilizing the Pandas library, you should be able to accomplish this by passing the header= argument to the read_csv method like so:

# will pull columns indexed [0,2,4]
df = pd.read_csv(StringIO(csv_string), header=[0,2,4]) 

From the docs: ... The header can be a list of integers that specify row locations for a multi-index on the columns e.g. [0,1,3]. Intervening rows that are not specified will be skipped (e.g. 2 in this example is skipped)...

Upvotes: 0

Related Questions