Reputation: 1677
Scenario: I have a code that reads a set of excel files from a directory and gathers the contents of each to a dataframe in a list then concatenates it. The code also reads another file where it gets the data for some identifiers into another dataframe.
Example of data in the concatenated dataframe from the list:
Iteration Run Value
9154aa 3 100
9154aa 7 112
9154aa 1 120
3148nf 77 58
3148nf 7 86
9421jh 23 27
9421jh 42 736
9421jh 4 44
9421jh 9 82
Example of the other dataframe:
Iteration Date
9154aa 01012011
1582he 01052013
3148nf 01092011
9421jh 01012010
The first DF has information for multiple iterations concatenated, while the additional DF has a piece of information for all iterations.
Objective: My objective is to put the date related to a iteration into the first dataframe (in every row that corresponds to that iterations).
Output Example:
Iteration Run Value Date
9154aa 3 100 01012011
9154aa 7 112 01012011
9154aa 1 120 01012011
3148nf 77 58 01092011
3148nf 7 86 01092011
9421jh 23 27 01012010
9421jh 42 736 01012010
9421jh 4 44 01012010
9421jh 9 82 01012010
Issue: Although the script runs with no crashes, for some reason my output is duplicating one (or more) of my iteration entries.
Example of flawed output:
Iteration Run Value Date
9154aa 3 100 01012011
9154aa 7 112 01012011
9154aa 1 120 01012011
3148nf 77 58 01092011
3148nf 77 58 01092011
3148nf 7 86 01092011
3148nf 7 86 01092011
9421jh 23 27 01012010
9421jh 42 736 01012010
9421jh 4 44 01012010
9421jh 9 82 01012010
I have no idea of the reason for this behavior.
Question: What am I doing wrong?
Code:
sourcefolder = "\\Network\DGMS\2018"
outputfolder = "\\Network\DGMS\2018"
adjustmentinputs = "//Network/DGMS/Uploader_v1.xlsm"
selectmonth = input("Please enter month ('January', 'February'...):")
# Get Adjustments
ApplyOnDates = pd.read_excel(open(adjustmentinputs, 'rb'), sheet_name='Calendar')
# Get content
all_files = glob.glob(os.path.join(sourcefolder, "*.xls*"))
contentdataframes = []
contentdataframes2 = []
for f in all_files:
df = pd.read_excel(f)
df['Iteration'] = os.path.basename(f).split('.')[0].split('_')[0]
mask = df.columns.str.contains('Base|Last|Fix')
c2 = df.columns[~mask].tolist()
df = df[c2]
contentdataframes.append(df)
print (f)
concatenatedfinal = pd.concat(contentdataframes)
# Date Adjustment
ApplyOnDates = ApplyOnDates[["IT", selectmonth]]
ApplyOnDates = ApplyOnDates.rename(index=str, columns={"IT": "Iteration", selectmonth: "Date"})
Datawithfixeddates = pd.DataFrame.merge(concatenatedfinal, ApplyOnDates, left_on='Iteration', right_on='Iteration', indicator=False)
OBS: In the example I used only a small amount of data, while normally it would do it for dozens of iterations.
Upvotes: 1
Views: 57
Reputation: 1559
You need to use a left join here. As per the documentation, left join preserves all of the values in the first DataFrame, subbing in values from the second dependent on the structure of the first.
Set the parameter pd.DataFrame.merge(how='left')
For your data as below:
In[13]: print(df1)
Out[13]:
Iteration Run Value
0 9154aa 3 100
1 9154aa 7 112
2 9154aa 1 120
3 3148nf 77 58
4 3148nf 7 86
5 9421jh 23 27
6 9421jh 42 736
7 9421jh 4 44
8 9421jh 9 82
and
In[15]: print(df2)
Out[15]:
Iteration Date
0 9154aa 01012011
1 1582he 01052013
2 3148nf 01092011
3 9421jh 01012010
The following is true
In[16]: print(df1.merge(df2,left_on='Iteration',right_on='Iteration',how='left'))
Out[16]:
Iteration Run Value Date
0 9154aa 3 100 01012011
1 9154aa 7 112 01012011
2 9154aa 1 120 01012011
3 3148nf 77 58 01092011
4 3148nf 7 86 01092011
5 9421jh 23 27 01012010
6 9421jh 42 736 01012010
7 9421jh 4 44 01012010
8 9421jh 9 82 01012010
Upvotes: 2