Anup Dutta
Anup Dutta

Reputation: 65

Python Panda-Specific format of text with hyphen

I need to change multiple such python dataframe columns that do not follow a specific format like Name-ID-Date. And want to change that to follow the same format. I have attached the input and Corrected output format as images.

I have written some code that basically looks at all the columns in dataframe and if it follows the format then it separates the data into 3 different columns but if does not follow the specific format of Name-ID-Date the code is not able to proceed. Any help will be highly appreciated here.

Input & Corrected output file

    dff = df[['PPS_REQ','Candidate1', 'Candidate2',
       'Candidate3', 'Candidate4', 'Candidate5', 'Candidate6', 'Candidate7',
       'Candidate8', 'Candidate9','Candidate10', 'Candidate11', 'Candidate12',
        'Candidate13', 'Candidate14', 'Candidate15', 'Candidate16',
        'Candidate17', 'Candidate18', 'Candidate19', 'Candidate20','Candidate21',
        'Candidate22','Candidate23','Candidate24','Candidate25','Candidate26','Candidate27','Candidate28']]
all_candiadates = ['Candidate1', 'Candidate2',
           'Candidate3', 'Candidate4', 'Candidate5', 'Candidate6', 'Candidate7',
           'Candidate8', 'Candidate9','Candidate10', 'Candidate11', 'Candidate12',
            'Candidate13', 'Candidate14', 'Candidate15', 'Candidate16',
            'Candidate17', 'Candidate18', 'Candidate19', 'Candidate20','Candidate21',
            'Candidate22','Candidate23','Candidate24','Candidate25','Candidate26','Candidate27','Candidate28']#,'Candidate29','Candidate30','Candidate31','Candidate32','Candidate33','Candidate34','Candidate35','Candidate36','Candidate37','Candidate38']
         
    
    blank = pd.DataFrame()
    
    for index, row in dff.iterrows():
        for c in all_candiadates:
            print('the value of c :',c)
            candidate = dff[['PPS_REQ',c]]
            candidate[['Name','Id','Sdate']] = candidate[c].str.split('-',n=-1,expand=True)
            blank = blank.append(candidate)

Thank you

Upvotes: 1

Views: 62

Answers (2)

Anup Dutta
Anup Dutta

Reputation: 65

Finally this is fixed, just adding this if this is useful for someone else.

blank = pd.DataFrame()

#for index, row in dff.iterrows():
for c in all_candiadates:
#        print('the value of c :',c)
        try:
            candidate = dff[['PPS_REQ',c]]
            candidate = candidate[candidate[c].str.contains('FILL|Reopen|Fill|REOPEN|Duplicate|reopen|FILED|fill') != True]
            candidate=candidate.loc[(candidate[c] !="")]
            candidate['Sdate'] = candidate[c].str.extract('(\d+/\d+)', expand=True)
            candidate['Id'] =  candidate[c].str.extract('(\d\d\d\d\d\d\d)', expand=True)
            candidate['Name'] =  candidate[c].str.extract('([a-zA-Z ]*)\d*.*', expand=False)
#        candidate[['Name','Id','Sdate']] = candidate[c].str.split('-',n=-1,expand=True)
            blank = blank.append(candidate)
        except:
            pass

blank = blank[['PPS_REQ', 'Name','Id','Sdate']]
bb = blank.drop_duplicates()

Upvotes: 0

Anup Dutta
Anup Dutta

Reputation: 65

i have done some workaround in the code something like below, But the problem I am facing with this part of code:

candidate['Sdate'] = candidate[c].str.extract('(../..)', expand=True)

Here if Date is 11/18 it works fine, but if date is 11/8 it returns nan.

for index, row in dff.iterrows():
    for c in all_candiadates:
        print('the value of c :',c)
        candidate = dff[['PPS_REQ',c]]
        candidate['Sdate'] = candidate[c].str.extract('(../..)', expand=True)
        candidate['Id'] =  candidate[c].str.extract('(\d\d\d\d\d\d\d)', expand=True)
        candidate['Name'] =  candidate[c].str.extract('([a-zA-Z ]*)\d*.*', expand=False)
#        candidate[['Name','Id','Sdate']] = candidate[c].str.split('-',n=-1,expand=True)
        blank = blank.append(candidate)

Upvotes: 2

Related Questions