Merging Pandas dataframes on same column identifier with improper output

Question

Scenario: I have a code that reads a set of excel files from a directory and gathers the contents of each to a dataframe in a list then concatenates it. The code also reads another file where it gets the data for some identifiers into another dataframe.

Example of data in the concatenated dataframe from the list:

Iteration        Run        Value
 9154aa           3          100
 9154aa           7          112
 9154aa           1          120
 3148nf           77         58
 3148nf           7          86
 9421jh           23         27
 9421jh           42         736
 9421jh           4          44
 9421jh           9          82

Example of the other dataframe:

Iteration      Date
 9154aa      01012011
 1582he      01052013
 3148nf      01092011
 9421jh      01012010

The first DF has information for multiple iterations concatenated, while the additional DF has a piece of information for all iterations.

Objective: My objective is to put the date related to a iteration into the first dataframe (in every row that corresponds to that iterations).

Output Example:

Iteration        Run        Value       Date
 9154aa           3          100      01012011
 9154aa           7          112      01012011
 9154aa           1          120      01012011
 3148nf           77         58       01092011
 3148nf           7          86       01092011
 9421jh           23         27       01012010
 9421jh           42         736      01012010
 9421jh           4          44       01012010
 9421jh           9          82       01012010

Issue: Although the script runs with no crashes, for some reason my output is duplicating one (or more) of my iteration entries.

Example of flawed output:

Iteration        Run        Value       Date
 9154aa           3          100      01012011
 9154aa           7          112      01012011
 9154aa           1          120      01012011
 3148nf           77         58       01092011
 3148nf           77         58       01092011
 3148nf           7          86       01092011
 3148nf           7          86       01092011
 9421jh           23         27       01012010
 9421jh           42         736      01012010
 9421jh           4          44       01012010
 9421jh           9          82       01012010

I have no idea of the reason for this behavior.

Question: What am I doing wrong?

Code:

sourcefolder = "\Network\DGMS\2018"
outputfolder = "\Network\DGMS\2018"
adjustmentinputs = "//Network/DGMS/Uploader_v1.xlsm"
selectmonth = input("Please enter month ('January', 'February'...):")

# Get Adjustments
ApplyOnDates = pd.read_excel(open(adjustmentinputs, 'rb'), sheet_name='Calendar')

# Get content
all_files = glob.glob(os.path.join(sourcefolder, "*.xls*"))
contentdataframes = []
contentdataframes2 = []

for f in all_files:
    df = pd.read_excel(f)
    df['Iteration'] = os.path.basename(f).split('.')[0].split('_')[0]
    mask = df.columns.str.contains('Base|Last|Fix')
    c2 = df.columns[~mask].tolist()
    df = df[c2]
    contentdataframes.append(df)
    print (f)

concatenatedfinal = pd.concat(contentdataframes)

# Date Adjustment 
ApplyOnDates = ApplyOnDates[["IT", selectmonth]]
ApplyOnDates = ApplyOnDates.rename(index=str, columns={"IT": "Iteration", selectmonth: "Date"})

Datawithfixeddates = pd.DataFrame.merge(concatenatedfinal, ApplyOnDates, left_on='Iteration', right_on='Iteration', indicator=False)

OBS: In the example I used only a small amount of data, while normally it would do it for dozens of iterations.

Rushabh Mehta · Accepted Answer

You need to use a left join here. As per the documentation, left join preserves all of the values in the first DataFrame, subbing in values from the second dependent on the structure of the first.

Set the parameter pd.DataFrame.merge(how='left')

For your data as below:

In[13]: print(df1)
Out[13]: 
Iteration  Run  Value
0    9154aa    3    100
1    9154aa    7    112
2    9154aa    1    120
3    3148nf   77     58
4    3148nf    7     86
5    9421jh   23     27
6    9421jh   42    736
7    9421jh    4     44
8    9421jh    9     82

and

In[15]: print(df2)
Out[15]: 
Iteration      Date
0    9154aa  01012011
1    1582he  01052013
2    3148nf  01092011
3    9421jh  01012010

The following is true

In[16]: print(df1.merge(df2,left_on='Iteration',right_on='Iteration',how='left'))
Out[16]: 
Iteration  Run  Value      Date
0    9154aa    3    100  01012011
1    9154aa    7    112  01012011
2    9154aa    1    120  01012011
3    3148nf   77     58  01092011
4    3148nf    7     86  01092011
5    9421jh   23     27  01012010
6    9421jh   42    736  01012010
7    9421jh    4     44  01012010
8    9421jh    9     82  01012010

Merging Pandas dataframes on same column identifier with improper output

Answers (1)

Related Questions