Reputation: 308

Group categories in Pandas

i'm new with pandas and data visualisations. I'm working on some OkCupid dataset and want to manipulate some data.. I have a column 'education' with several options:

['graduated from college/university', 'graduated from masters program',
       'working on college/university', 'working on masters program',
       'graduated from two-year college', 'graduated from high school',
       'graduated from ph.d program', 'graduated from law school',
       'working on two-year college', 'dropped out of college/university',
       'working on ph.d program', 'college/university',
       'graduated from space camp', 'dropped out of space camp',
       'graduated from med school', 'working on space camp',
       'working on law school', 'two-year college', 'working on med school',
       'dropped out of two-year college', 'dropped out of masters program',
       'masters program', 'dropped out of ph.d program',
       'dropped out of high school', 'high school', 'working on high school',
       'space camp', 'ph.d program', 'law school', 'dropped out of law school',
       'dropped out of med school', 'med school']

And I'd like to unite them by the following dictionary to be able to plot them more conveniently:

education_cats = {
    'High-school student' : ['dropped out of high school', 'working on high school'],
    'Ungraduated' : ['graduated from high school', 'dropped out of college/university', 'dropped out of space camp', 
                     'dropped out of two-year college', 'high school', 'dropped out of law school','dropped out of med school'],
    'Student' : ['working on college/university', 'working on two-year college', 'working on law school', 'working on med school'],
    'Graduated' : ['graduated from college/university', 'graduated from two-year college', 'graduated from law school',  
                   'college/university', 'graduated from space camp', 'working on space camp', 'graduated from med school', 
                   'two-year college', 'dropped out of masters program', 'space camp', 'law school' 'med school'],
    '2nd-degree student' : ['working on masters program'],
    'Master' : ['graduated from masters program', 'masters program', 'dropped out of ph.d program'],
    '3rd-degree student' : ['working on ph.d program'],
    'P.hd' : ['graduated from ph.d program', 'ph.d program']
}

I've tried this way:

def find_key(value):
    for k in education_cats.keys():
        if value in education_cats[k]:
            return k
    return np.nan
df['education_category'] = df['education'].map(find_key, na_action='ignore')

There's any pandas build-in way to do this? or this is the best efford?

Upvotes: 1

Answers (2)

Eric Truett

Reputation: 3010

It will be easier to build the dictionary with the values as the keys instead of a list.

education_cats = {
    'High-school student' : ['dropped out of high school', 'working on high school'],
    'Ungraduated' : ['graduated from high school', 'dropped out of college/university', 'dropped out of space camp', 
                     'dropped out of two-year college', 'high school', 'dropped out of law school','dropped out of med school'],
    'Student' : ['working on college/university', 'working on two-year college', 'working on law school', 'working on med school'],
    'Graduated' : ['graduated from college/university', 'graduated from two-year college', 'graduated from law school',  
                   'college/university', 'graduated from space camp', 'working on space camp', 'graduated from med school', 
                   'two-year college', 'dropped out of masters program', 'space camp', 'law school' 'med school'],
    '2nd-degree student' : ['working on masters program'],
    'Master' : ['graduated from masters program', 'masters program', 'dropped out of ph.d program'],
    '3rd-degree student' : ['working on ph.d program'],
    'P.hd' : ['graduated from ph.d program', 'ph.d program']
}

cats = {}
for cat, l in education_cats.items():
    for item in l:
        cats[item] = cat

Now you can use apply or ```map`` with a default value

default_value = 'Unknown'

df['education_category'] = df['education'].apply(lambda x: cats.get(x, default_value)

df['education_category'] = df['education'].map(cats).fillna(default_value)

Upvotes: 1

yatu

Reputation: 88236

Say the Series of lists in in the studies column. You can split on the first space, and then just add the values into a defaultdict accordingly:

l = df.studies.str.split(' ',1, expand=True).values.tolist()

from collections import defaultdict
d = defaultdict(list)
for i in l:
    d[i[0]].append(i[1])

print(d)

defaultdict(list,
            {'graduated': ['from college/university',
              'from masters program',
              'from two-year college',
              'from high school',
              'from ph.d program',
              'from law school',
              'from space camp',
              'from med school'],
             'working': ['on college/university',
              'on masters program',
              'on two-year college',
              'on ph.d program',
              'on space camp',
              'on law school',
              'on med school',
              'on high school'],
             'dropped': ['out of college/university',
              'out of space camp',
              'out of two-year college',
              'out of masters program',
              'out of ph.d program',
              'out of high school',
              'out of law school',
              'out of med school'],
             'college/university': [None],
             'two-year': ['college'],
             'masters': ['program'],
             'high': ['school'],
             'space': ['camp'],
             'ph.d': ['program'],
             'law': ['school'],
             'med': ['school']})

Upvotes: 1

Group categories in Pandas

Answers (2)

Related Questions