Reputation: 65
I need to change multiple such python dataframe columns that do not follow a specific format like Name-ID-Date. And want to change that to follow the same format. I have attached the input and Corrected output format as images.
I have written some code that basically looks at all the columns in dataframe and if it follows the format then it separates the data into 3 different columns but if does not follow the specific format of Name-ID-Date the code is not able to proceed. Any help will be highly appreciated here.
dff = df[['PPS_REQ','Candidate1', 'Candidate2',
'Candidate3', 'Candidate4', 'Candidate5', 'Candidate6', 'Candidate7',
'Candidate8', 'Candidate9','Candidate10', 'Candidate11', 'Candidate12',
'Candidate13', 'Candidate14', 'Candidate15', 'Candidate16',
'Candidate17', 'Candidate18', 'Candidate19', 'Candidate20','Candidate21',
'Candidate22','Candidate23','Candidate24','Candidate25','Candidate26','Candidate27','Candidate28']]
all_candiadates = ['Candidate1', 'Candidate2',
'Candidate3', 'Candidate4', 'Candidate5', 'Candidate6', 'Candidate7',
'Candidate8', 'Candidate9','Candidate10', 'Candidate11', 'Candidate12',
'Candidate13', 'Candidate14', 'Candidate15', 'Candidate16',
'Candidate17', 'Candidate18', 'Candidate19', 'Candidate20','Candidate21',
'Candidate22','Candidate23','Candidate24','Candidate25','Candidate26','Candidate27','Candidate28']#,'Candidate29','Candidate30','Candidate31','Candidate32','Candidate33','Candidate34','Candidate35','Candidate36','Candidate37','Candidate38']
blank = pd.DataFrame()
for index, row in dff.iterrows():
for c in all_candiadates:
print('the value of c :',c)
candidate = dff[['PPS_REQ',c]]
candidate[['Name','Id','Sdate']] = candidate[c].str.split('-',n=-1,expand=True)
blank = blank.append(candidate)
Thank you
Upvotes: 1
Views: 62
Reputation: 65
Finally this is fixed, just adding this if this is useful for someone else.
blank = pd.DataFrame()
#for index, row in dff.iterrows():
for c in all_candiadates:
# print('the value of c :',c)
try:
candidate = dff[['PPS_REQ',c]]
candidate = candidate[candidate[c].str.contains('FILL|Reopen|Fill|REOPEN|Duplicate|reopen|FILED|fill') != True]
candidate=candidate.loc[(candidate[c] !="")]
candidate['Sdate'] = candidate[c].str.extract('(\d+/\d+)', expand=True)
candidate['Id'] = candidate[c].str.extract('(\d\d\d\d\d\d\d)', expand=True)
candidate['Name'] = candidate[c].str.extract('([a-zA-Z ]*)\d*.*', expand=False)
# candidate[['Name','Id','Sdate']] = candidate[c].str.split('-',n=-1,expand=True)
blank = blank.append(candidate)
except:
pass
blank = blank[['PPS_REQ', 'Name','Id','Sdate']]
bb = blank.drop_duplicates()
Upvotes: 0
Reputation: 65
i have done some workaround in the code something like below, But the problem I am facing with this part of code:
candidate['Sdate'] = candidate[c].str.extract('(../..)', expand=True)
Here if Date is 11/18 it works fine, but if date is 11/8 it returns nan.
for index, row in dff.iterrows():
for c in all_candiadates:
print('the value of c :',c)
candidate = dff[['PPS_REQ',c]]
candidate['Sdate'] = candidate[c].str.extract('(../..)', expand=True)
candidate['Id'] = candidate[c].str.extract('(\d\d\d\d\d\d\d)', expand=True)
candidate['Name'] = candidate[c].str.extract('([a-zA-Z ]*)\d*.*', expand=False)
# candidate[['Name','Id','Sdate']] = candidate[c].str.split('-',n=-1,expand=True)
blank = blank.append(candidate)
Upvotes: 2