Reputation: 392
This question is related to Adding rows per group in pandas / ipython if per group a row is missing, but is a bit more complicated.
I have a table like this:
ID DEGREE TERM STATUS GRADTERM
1 Bachelors 20111 1
1 Bachelors 20116 1
2 Bachelors 20126 1
2 Bachelors 20131 1
2 Bachelors 20141 1
3 Bachelors 20106 1
3 Bachelors 20111 1 20116
3 Masters 20116 1
3 Masters 20121 1
3 Masters 20131 1 20136
What I would like is to turn that into this (when run for term 20151):
ID DEGREE TERM STATUS
1 Bachelors 20111 1
1 Bachelors 20116 1
1 Bachelors 20121 0
1 Bachelors 20126 0
1 Bachelors 20131 0
1 Bachelors 20136 0
1 Bachelors 20141 0
1 Bachelors 20146 0
1 Bachelors 20151 0
2 Bachelors 20126 1
2 Bachelors 20131 1
2 Bachelors 20136 0
2 Bachelors 20141 1
2 Bachelors 20146 0
2 Bachelors 20151 0
3 Bachelors 20106 1
3 Bachelors 20111 1
3 Bachelors 20116 2
3 Bachelors 20121 2
3 Bachelors 20126 2
3 Bachelors 20131 2
3 Bachelors 20136 2
3 Bachelors 20141 2
3 Bachelors 20146 2
3 Bachelors 20151 2
3 Masters 20116 1
3 Masters 20121 1
3 Masters 20126 0
3 Masters 20131 1
3 Masters 20136 2
3 Masters 20141 2
3 Masters 20146 2
3 Masters 20151 2
In each table, STATUS is 0 - Not Enrolled, 1 - Enrolled, and 2 - Graduated. The TERM fields are the year followed by a 1 or 6 for spring or fall.
Missing TERM records should be added for each person between their first record and the current term (which is 20151 in this case). For each added record, assign a STATUS of 0 unless the last existing record has a STATUS of 2 (which carries). That is, a person is enrolled (STATUS=1) or they are not (STATUS=0 or 2).
I am using pandas in Python, but I am new to Python. I have been trying to figure out how the indexing for a DataFrame works, but that is a complete mystery at this point. Any guidance would be greatly appreciated.
Upvotes: 1
Views: 86
Reputation: 24752
You can do it this way.
import pandas as pd
# python 3.4 used
import io
# just try to replicate your data. Use your own csv file instead
# =========================================================
csv = 'ID,DEGREE,TERM,STATUS,GRADTERM\n1,Bachelors,20111,1,\n1,Bachelors,20116,1,\n2,Bachelors,20126,1,\n2,Bachelors,20131,1,\n2,Bachelors,20141,1,\n3,Bachelors,20106,1,\n3,Bachelors,20111,1,20116.0\n3,Masters,20116,1,\n3,Masters,20121,1,\n3,Masters,20131,1,20136.0\n'
df = pd.read_csv(io.StringIO(csv)).set_index('ID')
print(df)
DEGREE TERM STATUS GRADTERM
ID
1 Bachelors 20111 1 NaN
1 Bachelors 20116 1 NaN
2 Bachelors 20126 1 NaN
2 Bachelors 20131 1 NaN
2 Bachelors 20141 1 NaN
3 Bachelors 20106 1 NaN
3 Bachelors 20111 1 20116
3 Masters 20116 1 NaN
3 Masters 20121 1 NaN
3 Masters 20131 1 20136
# two helper functions
# =========================================================
def build_year_term_range(start_term, current_term):
# assumes start_term current_term in format '20151' alike
start_year = int(start_term[:4]) # first four are year
start_term = int(start_term[-1]) # last four is term
current_year = int(current_term[:4])
current_term = int(current_term[-1])
# build a range
year_rng = np.repeat(np.arange(start_year, current_year+1), 2)
term_rng = [1, 6] * int(len(year_rng) / 2)
year_term_rng = [int(str(year) + str(term)) for year, term in zip(year_rng, term_rng)]
# check whether need to trim the first and last
if start_term == 6: # remove the first
year_term_rng = year_term_rng[1:]
if current_term == 1: # remove the last
year_term_rng = year_term_rng[:-1]
return year_term_rng
def my_apply_func(group, current_year_term=current_year_term):
# start of the record
start_year_term = str(group['TERM'].iloc[0]) # gives 2001
year_term_rng = build_year_term_range(start_year_term, current_year_term)
# manipulate the group
group = group.reset_index().set_index('TERM')
# use reindex to populate missing rows
group = group.reindex(year_term_rng)
# fillna ID/DEGREE same as previous
group[['ID', 'DEGREE']] = group[['ID', 'DEGREE']].fillna(method='ffill')
# fillna by 0 not enrolled (for now)
group['STATUS'] = group['STATUS'].fillna(0)
# shift GRADTERM 1 slot forward, because GRADTERM and TERM are not aligned
group['GRADTERM'] = group['GRADTERM'].shift(1)
# check whether has been graduate, convert to int, use cumsum to carry that non-zero entry forward, convert back to boolean
# might seems non-trivial at first place :)
group.loc[group['GRADTERM'].notnull().astype(int).cumsum().astype(bool), 'STATUS'] = 2
# return only relevant columns
return group['STATUS']
# start processing
# ============================================================
# move ID from index to a normal column
df = df.reset_index()
# please specify the current year term in string
current_year_term = '20151'
# assume ID is your index column
result = df.groupby(['ID', 'DEGREE']).apply(my_apply_func).reset_index()
Out[163]:
ID DEGREE TERM STATUS
0 1 Bachelors 20111 1
1 1 Bachelors 20116 1
2 1 Bachelors 20121 0
3 1 Bachelors 20126 0
4 1 Bachelors 20131 0
5 1 Bachelors 20136 0
6 1 Bachelors 20141 0
7 1 Bachelors 20146 0
8 1 Bachelors 20151 0
9 2 Bachelors 20126 1
10 2 Bachelors 20131 1
11 2 Bachelors 20136 0
12 2 Bachelors 20141 1
13 2 Bachelors 20146 0
14 2 Bachelors 20151 0
15 3 Bachelors 20106 1
16 3 Bachelors 20111 1
17 3 Bachelors 20116 2
18 3 Bachelors 20121 2
19 3 Bachelors 20126 2
20 3 Bachelors 20131 2
21 3 Bachelors 20136 2
22 3 Bachelors 20141 2
23 3 Bachelors 20146 2
24 3 Bachelors 20151 2
25 3 Masters 20116 1
26 3 Masters 20121 1
27 3 Masters 20126 0
28 3 Masters 20131 1
29 3 Masters 20136 2
30 3 Masters 20141 2
31 3 Masters 20146 2
32 3 Masters 20151 2
Upvotes: 1