Reputation: 1248
I have a scenario where I am trying to filter a dataframe by a particular value, and count how many times another identifier is present. I'm then turning that into a dictionary and mapping back to the dataframe. The issue I am having is that the resulting dictionary cannot be mapped back to the dataframe because I'm introducing complexity to the dictionary (extra keys?), and I don't know how to avoid it.
I guess the simple question is: 'How can I use value_counts on my CELL_ID column', filter by another column called Grid_Type, and map the results back to all cells per CELL_ID?
What I'm doing so far
This works to count how many cells contain the CELL_ID, but does NOT allow me to filter by Grid_Type
df['CELL_ID'].value_counts()
z1 = z.to_dict()
df['CELL_CNT'] = df['CELL_ID'].map(z1)
The dictionary output from this simple example looks like:
7015988: 1, 7122961: 1, 6976792: 1
My bad code
This is what I've been working on so far - where I want to be able to return the count, filtered by the Grid_Type. Eg I want to be able to count the number of times I see "Spot" in/by each CELL_ID.
z = df[df.Grid_Type == 'Spot'].groupby('CELL_ID')['Grid_Type'].value_counts()
z1 = z.to_dict()
df['SPOT_CNT'] = df['CELL_ID'].map(z1)
It seems that in the example where I'm trying to filter that the dictionary is returning a more complex result that includes the Grid_Type. The thing is, I only want the counts mapped against the Cell_ID. Eg dictionary response:
(7133691, 'Spot'): 3, (7133692, 'Spot'): 3, (7133693, 'Spot'): 2
Example Data
+---------+-----------+
| CELL_ID | Grid_Type |
+---------+-----------+
| 001 | Spot |
| 001 | Square |
| 001 | Spot |
| 001 | Square |
| 001 | Square |
| 002 | Spot |
| 002 | Square |
| 002 | Square |
| 003 | Square |
| 003 | Spot |
| 003 | Spot |
| 003 | Spot |
+---------+-----------+
Desired Outcome
+---------+-----------+----------+
| CELL_ID | Grid_Type | SPOT_CNT |
+---------+-----------+----------+
| 001 | Spot | 2 |
| 001 | Square | 2 |
| 001 | Spot | 2 |
| 001 | Square | 2 |
| 001 | Square | 2 |
| 002 | Spot | 1 |
| 002 | Square | 1 |
| 002 | Square | 1 |
| 003 | Square | 3 |
| 003 | Spot | 3 |
| 003 | Spot | 3 |
| 003 | Spot | 3 |
+---------+-----------+----------+
Thanks for any help you might be able to offer/
Upvotes: 1
Views: 661
Reputation: 854
Seems you have an answer, but I would approach this problem with transform()
:
# set it up
df = pd.read_clipboard()
print(df)
CELL_ID Grid_Type
0 1 Spot
1 1 Square
2 1 Spot
3 1 Square
4 1 Square
5 2 Spot
6 2 Square
7 2 Square
8 3 Square
9 3 Spot
10 3 Spot
11 3 Spot
df['SPOT_CNT'] = df.groupby('CELL_ID')['Grid_Type'].transform(lambda x: sum(x == 'Spot'))
print(df)
CELL_ID Grid_Type SPOT_CNT
0 1 Spot 2
1 1 Square 2
2 1 Spot 2
3 1 Square 2
4 1 Square 2
5 2 Spot 1
6 2 Square 1
7 2 Square 1
8 3 Square 3
9 3 Spot 3
10 3 Spot 3
11 3 Spot 3
Inside the lambda
function:
- it returns bool if value(x
) == 'Spot'
- for each group, sum()
adds up the True
bools
Lastly transform
,as per docs, behaves like so:
DataFrame.transform(self, func, axis=0, *args, **kwargs) → 'DataFrame'[source]
"Call func on self producing a DataFrame with transformed values."
"Produced DataFrame will have same axis length as self." <----
...
Hope this is helpful.
Upvotes: 1
Reputation: 3086
df = pd.read_csv('spot.txt', sep=r"[ ]{1,}", engine='python', dtype='object')
print(df)
CELL_ID Grid_Type
0 001 Spot
1 001 Square
2 001 Spot
3 001 Square
4 001 Square
5 002 Spot
6 002 Square
7 002 Square
8 003 Square
9 003 Spot
10 003 Spot
11 003 Spot
df_gb = df['Grid_Type'].groupby([df['CELL_ID']]).value_counts()
print(df_gb)
CELL_ID Grid_Type
001 Square 3
Spot 2
002 Square 2
Spot 1
003 Spot 3
Square 1
Name: Grid_Type, dtype: int64
df_gb_dict = df_gb.to_dict()
count_list = []
for idx, row in df.iterrows():
for k, v in df_gb_dict.items():
if k[0] == row['CELL_ID'] and k[1] == row['Grid_Type'] and row['Grid_Type'] == 'Spot':
count_list.append([k[0], k[1], v])
if k[0] == row['CELL_ID'] and k[1] == row['Grid_Type'] and row['Grid_Type'] == 'Square':
count_list.append([k[0], k[1], df_gb_dict[(row['CELL_ID'], 'Spot')]])
new_df = pd.DataFrame(count_list, columns=['CELL_ID', 'Grid_Type', 'SPOT_CNT'])
new_df.sort_values(by='CELL_ID', inplace=True)
new_df.reset_index(drop=True)
print(new_df)
CELL_ID Grid_Type SPOT_CNT
0 001 Spot 2
1 001 Square 2
2 001 Spot 2
3 001 Square 2
4 001 Square 2
5 002 Spot 1
6 002 Square 1
7 002 Square 1
8 003 Square 3
9 003 Spot 3
10 003 Spot 3
11 003 Spot 3
Upvotes: 1