Reputation: 191
I am currently working on a project where I match as well as merge two excel documents by meter indication (km). The two dataframes contain the same type of data but are slightly different. My solution to this problem was to divide each dataframe by 100 to more easily match the columns. However, as I predicted there will be occasions where this doesn't work, making this a temporary solution.
I have two dataframes:
1.
Meter_indication Fuel1
1180784 275
1181278 280
1181791 300
1182285 280
1182801 300
1183295 280
1183717 250
Meter_indication Fuel2
1180785 278
1181282 282
1181800 310
1182401 282
1182824 320
1183310 215
1183727 250
As you can see these dataframes are containing the same type of information but are slightly off. I have previously used the method of merge merged_df = df1filt2.merge(df2filt, on='Meter_indication')
which does only merge values that perfectly match. However, in this scenario this method is irrelevant.
My desired output is:
Meter_indication Fuel1 Fuel2
1180784 275 278
1181278 280 282
1181791 300 310
1182285 280 282
1182801 300 320
1183295 280 215
1183717 250 250
As you can see, the dataframes have merged on "Meter_indication" and have merged by finding the closest value compared to itself.
I have looked around for others with similar problems and have tried many different suggestions such as https://pandas.pydata.org/pandas-docs/version/0.25.0/reference/api/pandas.merge_asof.html
Merging pandas dataframes based on nearest value(s)
But none have resulted in success.
My current code (necessary parts) is:
filepathname1=input1.variable
filepathname2=input2.variable
filepathname3=output1.variable
filepathname4=output2.variable
#Creating filepaths for Automatic and Manual doc (1 = Automatic, 2 = Manual).
print("You have chosen to mix", filepathname1, "and", filepathname2)
#Changes the option of pd (max rows and columns).
pd.set_option("display.max_rows", None, "display.max_columns", None)
#READS PROVIDED DOCUMENTS.
df1 = pd.read_excel(
filepathname1, sheetname, na_values=["NA"], skiprows=1, usecols="A, B, C, D, E, F")
df2 = pd.read_excel(
filepathname2,
na_values=["NA"],
skiprows=2,
usecols="D, AG, AH")
# Drop NaN rows.
df2.dropna(inplace=True)
df1.dropna(inplace=True)
print(df2)
#df100 = pd.DataFrame()
#df100['Bränslenivå (%)'] = df1['Bränslenivå (%)']
#df100['Bränslenivå (%)'] = (df100['Bränslenivå (%)'] >= 99)
#print(df100)
#input()
#Filters out rows with the keywords listed in 'blacklist'.
df1.rename(columns={"Bränslenivå (%)": "Bränsle"}, inplace=True)
df1 = df1[~df1.Bränsle.isin(blacklist)]
df1.rename(columns={"Bränsle": "Bränslenivå (%)"}, inplace=True)
df2.rename(columns={"Unnamed 32": "Actual refuel"}, inplace=True)
df2.rename(columns={"Unnamed 33": "Mätarställning"}, inplace=True)
#Creates new column for the difference in fuellevel column.
df1["Difference (%)"] = df1["Bränslenivå (%)"]
df1["Difference (%)"] = df1.loc[:, "Bränslenivå (%)"].diff()
# Renames time-column so that they match.
df2.rename(columns={"Datum": "Tid"}, inplace=True)
# Drops rows where the difference is equal to 0.
df1filt = df1[(df1["Difference (%)"] != 0)]
# Converts time-column to only year, month and date.
df1filt["Tid"] = pd.to_datetime(df1filt["Tid"]).dt.strftime("%Y%m%d").astype(str)
df1filt.reset_index(level=0, inplace=True)
#Renames the index column to "row" in order to later use the "row" column
df1filt.rename(columns={"index": "row"}, inplace=True)
# Creates a new column for the difference in total driven kilometers (used for matching)
df1filt["Match"] = df1filt["Vägmätare (km)"]
df1filt["Match"] = df1filt.loc[:, "Vägmätare (km)"].diff()
#Merges refuels that are previously seperated because of the timeintervals. For example when a refuel takes a lot of time and gets split into two different refuels.
ROWRANGE = len(df1filt)+1
thevalue = 0
for currentrow in range(ROWRANGE-1):
if df1filt.loc[currentrow, 'Difference (%)'] >= 0.0 and df1filt.loc[currentrow-1, 'Difference (%)'] <= 0:
thevalue = 0
thevalue += df1filt.loc[currentrow,'Difference (%)']
df1filt.loc[currentrow,'Match'] = "SUMMED"
if df1filt.loc[currentrow, 'Difference (%)'] >= 0.0 and df1filt.loc[currentrow-1, 'Difference (%)'] >= 0:
thevalue += df1filt.loc[currentrow,'Difference (%)']
if df1filt.loc[currentrow, 'Difference (%)'] <= 0.0 and df1filt.loc[currentrow-1, 'Difference (%)'] >= 0:
df1filt.loc[currentrow-1,'Difference (%)'] = thevalue
df1filt.loc[currentrow-1,'Match'] = "OFFICIAL"
thevalue = 0
#Removes single "refuels" that are lower than 5
df1filt = df1filt[(df1filt['Difference (%)'] > 5)]
#Creates a new dataframe for the summed values
df1filt2 = df1filt[(df1filt['Match'] == "OFFICIAL")]
#Creates a estimated refueled amount column for the automatic
df1filt2["Apparent refuel"] = df1filt2["Difference (%)"]
df1filt2["Apparent refuel"] = df1filt2.loc[:, "Difference (%)"]/100 *fuelcapacity
#Renames total kilometer column so that the two documents can match
df1filt2.rename(columns={"Vägmätare (km)": "Mätarställning"}, inplace=True)
#Filters out rows where refuel and kilometer = NaN (Manual)
df2.rename(columns={"x": "Actual refuel"}, inplace=True)
df2.rename(columns={"x.1": "Mätarställning"}, inplace=True)
df2.rename(columns={"Datum.1": "Tid"}, inplace=True)
for col in df2.columns:
print(col)
input()
df2filt = df2[(df2['Actual refuel'] != NaN) & (df2['Mätarställning'] != NaN)]
#Drops first row
df2filt.drop(df2filt.index[0], inplace=True)
#Adds prefix for the time column so that they match (not used anymore because km is used to match)
df2filt['Tid'] = '20' + df2filt['Tid'].astype(str)
#Rounds numeric columns
decimals = 0
df2filt['Mätarställning'] = pd.to_numeric(df2filt['Mätarställning'],errors='coerce')
df2filt['Actual refuel'] = pd.to_numeric(df2filt['Actual refuel'],errors='coerce')
df2filt['Mätarställning'] = df2filt['Mätarställning'].apply(lambda x: round(x, decimals))
df2filt['Actual refuel'] = df2filt['Actual refuel'].apply(lambda x: round(x, decimals))
#This is my temporary fix to the problem
df2filt['Mätarställning'] //= 100
df1filt2['Mätarställning'] //= 100
merged_df = df1filt2.merge(df2filt, on='Mätarställning')
merged_df.to_excel(filepathname3, index = False)
Hopefully this was enough information. Thank you in advance!
~UPDATE (RESULT/ QUESTIONING)~
Thank you @Rob Raymond for solving my problem. However with the solution provided the problem has been solved but at the same time new problems have occurred. The solution below solves the "merge"-problem partially.
The output now on some places seem to duplicate and rows seem to merge on the same Meter_indication. Im not sure what is causing this but it may have something to do with how "merge_asof" works. I have much data meaning that some data will have different amount of digits. Im not very familiar with the "merge_asof" method but one guess is that it merges on numbers that are similar but arent exactly the same. With other words, if this is the case, "1234" will merge with "12345" because both of them contain "1234".
Here are some graphs I have created using the data to clarify as well as illustrate the problem.
Current graph output:
Desired graph output:
With the data (used for graph) sometimes looks like this:
Upvotes: 2
Views: 2351
Reputation: 31226
merge_asof()
does work. matches your expected outputsort_values()
df1 = pd.read_csv(io.StringIO("""Meter_indication Fuel1
1180784 275
1181278 280
1181791 300
1182285 280
1182801 300
1183295 280
1183717 250"""), sep="\s+")
df2 = pd.read_csv(io.StringIO("""Meter_indication Fuel2
1180785 278
1181282 282
1181800 310
1182401 282
1182824 320
1183310 215
1183727 250"""), sep="\s+")
pd.merge_asof(
df1.sort_values("Meter_indication"),
df2.sort_values("Meter_indication").assign(mi=lambda d: d["Meter_indication"]),
on="Meter_indication",
direction="nearest",
)
Meter_indication | Fuel1 | Fuel2 | mi |
---|---|---|---|
1180784 | 275 | 278 | 1180785 |
1181278 | 280 | 282 | 1181282 |
1181791 | 300 | 310 | 1181800 |
1182285 | 280 | 282 | 1182401 |
1182801 | 300 | 320 | 1182824 |
1183295 | 280 | 215 | 1183310 |
1183717 | 250 | 250 | 1183727 |
Upvotes: 4