Phurich.P
Phurich.P

Reputation: 1416

Create new columns based on distinct values and count them

Sorry if the title is not clear enough. Let me explain what I want to achieve.

I have this Data-Frame, let's call it df.

id | Area
A    one
A    two
A    one
B    one
B    one
C    one
C    two
D    one
D    one
D    two
D    three

I would like to create a new Data-Frame based on the values in the existing Data-Frame. First, I would like to find a total sum of distinct id in df. Ex. id A has 3 entries, B has 2 entries, etc. Then create a new data frame out of it.

For our new Data-Frame, let's call it df_new

id | count 
 A    3
 B    2
 C    2
 D    4

Next, I would like to create a new column based on values in df['Area'], for this example, df['Area'] contains 3 distinct values (one, two, three). I would like to count the number of times an id has been in which Area. For example, id A has been in area one twice, once in area two and zero in area three. Then, I will append those values into a new column called one, two and three.

df_new :

id | count | one | two | three
 A    3       2     1      0
 B    2       2     0      0
 C    2       1     1      0
 D    4       2     1      1

I have developed my own code which produces df_new, however I believe Pandas has a better function to perform this sort of data extraction. Here is my code.

#Read the data
df = pd.read_csv('test_data.csv', sep = ',')
df.columns = ['id', 'Area'] #Rename
# Count a total number of Area by Id
df_new = pd.DataFrame({'count' : df.groupby("id")["Area"].count()})
# Reset index
df_new = df_new.reset_index()
#For loop for counting and creating a new column for areas in df['Area']
for i in xrange(0, len(df)):
    #Get the id
    idx = df['id'][i]
    #Get the areaname
    area_name = str(df["Area"][i])
    #Retrieve the index of a particular id
    current_index = df_new.loc[df_new['id'] == idx, ].index[0]
    #If area name exists in a column
    if area_name in df_new.columns:
        #Then +1 at the Location of the idx (Index)
        df_new[area_name][current_index] += 1
    #If not exists in the columns
    elif area_name not in df_new.columns:
        #Create an empty one with zeros
        df_new[area_name] = 0
        #Then +1 at the location of the idx (Index)
        df_new[area_name][current_index] += 1

The code is long and hard to read. It also suffers from the warning "A value is trying to be set on a copy of a slice from a DataFrame". I would like to learn more on how to write this effectively.

Thank you

Upvotes: 1

Views: 2370

Answers (1)

cs95
cs95

Reputation: 402553

You can use df.groupby.count for the first part and pd.crosstab for the the second. Then, use pd.concat to join em:

In [1246]: pd.concat([df.groupby('id').count().rename(columns={'Area' : 'count'}),\
                      pd.crosstab(df.id, df.Area)], 1)
Out[1246]: 
    count  one  three  two
id                        
A       3    2      0    1
B       2    2      0    0
C       2    1      0    1
D       4    2      1    1

Here's the first part using df.groupby:

df.groupby('id').count().rename(columns={'Area' : 'count'})

    count
id       
A       3
B       2
C       2
D       4 

Here's the second part with pd.crosstab:

pd.crosstab(df.id, df.Area)

Area  one  three  two
id                   
A       2      0    1
B       2      0    0
C       1      0    1
D       2      1    1

For the second part, you can also use pd.get_dummies and do a dot product:

(pd.get_dummies(df.id).T).dot(pd.get_dummies(df.Area))

   one  three  two
A    2      0    1
B    2      0    0
C    1      0    1
D    2      1    1

Upvotes: 1

Related Questions