Reputation: 9
The data types in a dataframe (object)are as follows:
id :int64
id_contains :object
categories :object
category contents :object
dtype: object
The data looks like this currently, organized by id and category contents:
id id_contains categories category contents
---------------------------------------------------------
1 a,b,c cat1 a,b,c
2 d,c,a cat2 c
3 c,b,e cat3 e,f,a
4 d,e,f cat4 a,c
I need to count the number of times cat1 occurs in id_contains and create separate columns for each category with the corresponding count for each id. So the resulting output dataframe should look like this:
id id_contains categories category_contents cat1 cat2 cat3 cat4
---------------------------------------------------------------------------
1 a,b,c cat1 a,b,c 3 1 1 2
2 d,c,a cat2 c 2 1 1 2
3 c,b,e cat3 e,f,a 2 1 0 1
4 d,e,f cat4 a,c 0 0 2 0
That is, for each id, I can see how many elements it contains from category 1, category 2, 3 and so on. I am new to pandas and data frames, and I apologize since I cannot share the actual data. I have tried to describe the original dataset as well as possible with dummy data based on previous suggestions from this community. Looking forward to all suggestions!
Upvotes: 0
Views: 244
Reputation: 9619
import pandas as pd
data = [ { "id": 1, "id_contains": "a,b,c", "categories": "cat1", "category_contents": "a,b,c" }, { "id": 2, "id_contains": "d,c,a", "categories": "cat2", "category_contents": "c" }, { "id": 3, "id_contains": "c,b,e", "categories": "cat3", "category_contents": "e,f,a" }, { "id": 4, "id_contains": "d,e,f", "categories": "cat4", "category_contents": "a,c" } ]
df = pd.DataFrame(data).set_index('id')
df['id_contains'] = df['id_contains'].str.split(',') #create list
df['category_contents'] = df['category_contents'].str.split(',') #create list
df['counts'] = df['id_contains'].apply(lambda x: [len(set(x) & set(i)) for i in df['category_contents'].tolist()]) #count occurrences for each category
df[['cat1','cat2','cat3','cat4']] = pd.DataFrame(df['counts'].tolist(), index= df.index) #turn list into columns
df.drop('counts', axis=1, inplace = True) #drop temporary counts column
result:
| id | id_contains | categories | category_contents | cat1 | cat2 | cat3 | cat4 |
|-----:|:----------------|:-------------|:--------------------|-------:|-------:|-------:|-------:|
| 1 | ['a', 'b', 'c'] | cat1 | ['a', 'b', 'c'] | 3 | 1 | 1 | 2 |
| 2 | ['d', 'c', 'a'] | cat2 | ['c'] | 2 | 1 | 1 | 2 |
| 3 | ['c', 'b', 'e'] | cat3 | ['e', 'f', 'a'] | 2 | 1 | 1 | 1 |
| 4 | ['d', 'e', 'f'] | cat4 | ['a', 'c'] | 0 | 0 | 2 | 0 |
Upvotes: 1