Reputation: 1416
Sorry if the title is not clear enough. Let me explain what I want to achieve.
I have this Data-Frame, let's call it df.
id | Area
A one
A two
A one
B one
B one
C one
C two
D one
D one
D two
D three
I would like to create a new Data-Frame based on the values in the existing Data-Frame. First, I would like to find a total sum of distinct id in df. Ex. id A has 3 entries, B has 2 entries, etc. Then create a new data frame out of it.
For our new Data-Frame, let's call it df_new
id | count
A 3
B 2
C 2
D 4
Next, I would like to create a new column based on values in df['Area'], for this example, df['Area'] contains 3 distinct values (one, two, three). I would like to count the number of times an id has been in which Area. For example, id A has been in area one twice, once in area two and zero in area three. Then, I will append those values into a new column called one, two and three.
df_new :
id | count | one | two | three
A 3 2 1 0
B 2 2 0 0
C 2 1 1 0
D 4 2 1 1
I have developed my own code which produces df_new, however I believe Pandas has a better function to perform this sort of data extraction. Here is my code.
#Read the data
df = pd.read_csv('test_data.csv', sep = ',')
df.columns = ['id', 'Area'] #Rename
# Count a total number of Area by Id
df_new = pd.DataFrame({'count' : df.groupby("id")["Area"].count()})
# Reset index
df_new = df_new.reset_index()
#For loop for counting and creating a new column for areas in df['Area']
for i in xrange(0, len(df)):
#Get the id
idx = df['id'][i]
#Get the areaname
area_name = str(df["Area"][i])
#Retrieve the index of a particular id
current_index = df_new.loc[df_new['id'] == idx, ].index[0]
#If area name exists in a column
if area_name in df_new.columns:
#Then +1 at the Location of the idx (Index)
df_new[area_name][current_index] += 1
#If not exists in the columns
elif area_name not in df_new.columns:
#Create an empty one with zeros
df_new[area_name] = 0
#Then +1 at the location of the idx (Index)
df_new[area_name][current_index] += 1
The code is long and hard to read. It also suffers from the warning "A value is trying to be set on a copy of a slice from a DataFrame". I would like to learn more on how to write this effectively.
Thank you
Upvotes: 1
Views: 2370
Reputation: 402553
You can use df.groupby.count
for the first part and pd.crosstab
for the the second. Then, use pd.concat
to join em:
In [1246]: pd.concat([df.groupby('id').count().rename(columns={'Area' : 'count'}),\
pd.crosstab(df.id, df.Area)], 1)
Out[1246]:
count one three two
id
A 3 2 0 1
B 2 2 0 0
C 2 1 0 1
D 4 2 1 1
Here's the first part using df.groupby
:
df.groupby('id').count().rename(columns={'Area' : 'count'})
count
id
A 3
B 2
C 2
D 4
Here's the second part with pd.crosstab
:
pd.crosstab(df.id, df.Area)
Area one three two
id
A 2 0 1
B 2 0 0
C 1 0 1
D 2 1 1
For the second part, you can also use pd.get_dummies
and do a dot product:
(pd.get_dummies(df.id).T).dot(pd.get_dummies(df.Area))
one three two
A 2 0 1
B 2 0 0
C 1 0 1
D 2 1 1
Upvotes: 1