user1361529
user1361529

Reputation: 2697

Data slicing a pandas frame - I'm having problems with unique

I am facing issues trying to select a subset of columns and running unique on it.

Source Data:

df_raw = pd.read_csv('data/master.csv', nrows=10000)
df_raw.shape()

Produces:

(10000, 86)

Process Data:

df = df_raw[['A','B','C']]
df.shape()

Produces:

(10000, 3)

Furthermore, doing:

df_raw.head()
df.head()

produces a correct list of rows and columns.

However,

print('RAW:',sorted(df_raw['A'].unique()))

works perfectly

Whilst:

print('PROCESSED:',sorted(df['A'].unique()))

produces:

AttributeError: 'DataFrame' object has no attribute 'unique'

What am I doing wrong? If the shape and head output are exactly what I want, I'm confused why my processed dataset is throwing errors. I did read Pandas 'DataFrame' object has no attribute 'unique' on SO which correctly states that unique needs to be applied to columns which is what I am doing.

Upvotes: 1

Views: 6182

Answers (2)

Manjula Devi
Manjula Devi

Reputation: 149

From the entire dataframe to extract a subset a data based on a column ID. This works!!

df = df.drop_duplicates(subset=['Id']) #where 'id' is the column used to filter
print (df)

Upvotes: 0

user1361529
user1361529

Reputation: 2697

This was a case of a duplicate column. Given this is proprietary data, I abstracted it as 'A', 'B', 'C' in this question and therefore masked the problem. (The real data set had 86 columns and I had duplicated one of those columns twice in my subset, and was trying to do a unique on that)

My problem was this:

df_raw = pd.read_csv('data/master.csv', nrows=10000)
df = df_raw[['A','B','C', 'A']] # <-- I did not realize I had duplicated A later.

This was causing problems when doing a unique on 'A'

Upvotes: 2

Related Questions