Leopold Wahlbeck
Leopold Wahlbeck

Reputation: 191

Merging on closest value Pandas

I am currently working on a project where I match as well as merge two excel documents by meter indication (km). The two dataframes contain the same type of data but are slightly different. My solution to this problem was to divide each dataframe by 100 to more easily match the columns. However, as I predicted there will be occasions where this doesn't work, making this a temporary solution.

I have two dataframes:

1.

Meter_indication      Fuel1
1180784               275
1181278               280
1181791               300
1182285               280
1182801               300
1183295               280
1183717               250
Meter_indication      Fuel2
1180785               278
1181282               282
1181800               310
1182401               282
1182824               320
1183310               215
1183727               250

As you can see these dataframes are containing the same type of information but are slightly off. I have previously used the method of merge merged_df = df1filt2.merge(df2filt, on='Meter_indication') which does only merge values that perfectly match. However, in this scenario this method is irrelevant.

My desired output is:

Meter_indication      Fuel1      Fuel2
1180784               275        278
1181278               280        282
1181791               300        310
1182285               280        282
1182801               300        320
1183295               280        215
1183717               250        250

As you can see, the dataframes have merged on "Meter_indication" and have merged by finding the closest value compared to itself.

I have looked around for others with similar problems and have tried many different suggestions such as https://pandas.pydata.org/pandas-docs/version/0.25.0/reference/api/pandas.merge_asof.html

Merging pandas dataframes based on nearest value(s)

But none have resulted in success.

My current code (necessary parts) is:

    filepathname1=input1.variable
    filepathname2=input2.variable
    filepathname3=output1.variable
    filepathname4=output2.variable

    #Creating filepaths for Automatic and Manual doc (1 = Automatic, 2 = Manual).
    print("You have chosen to mix", filepathname1, "and", filepathname2)

    #Changes the option of pd (max rows and columns).
    pd.set_option("display.max_rows", None, "display.max_columns", None)

    #READS PROVIDED DOCUMENTS.
    df1 = pd.read_excel(
        filepathname1, sheetname, na_values=["NA"], skiprows=1, usecols="A, B, C, D, E, F")
    df2 = pd.read_excel(
        filepathname2,
        na_values=["NA"],
        skiprows=2,
        usecols="D, AG, AH")

    # Drop NaN rows.
    df2.dropna(inplace=True)
    df1.dropna(inplace=True)

    print(df2)

    #df100 = pd.DataFrame()

    #df100['Bränslenivå (%)'] = df1['Bränslenivå (%)']

    #df100['Bränslenivå (%)'] = (df100['Bränslenivå (%)'] >= 99)


    #print(df100)
    #input()

    #Filters out rows with the keywords listed in 'blacklist'.
    df1.rename(columns={"Bränslenivå (%)": "Bränsle"}, inplace=True)
    df1 = df1[~df1.Bränsle.isin(blacklist)]
    df1.rename(columns={"Bränsle": "Bränslenivå (%)"}, inplace=True)

    df2.rename(columns={"Unnamed 32": "Actual refuel"}, inplace=True)
    df2.rename(columns={"Unnamed 33": "Mätarställning"}, inplace=True)

    #Creates new column for the difference in fuellevel column.
    df1["Difference (%)"] = df1["Bränslenivå (%)"]
    df1["Difference (%)"] = df1.loc[:, "Bränslenivå (%)"].diff()

    # Renames time-column so that they match.
    df2.rename(columns={"Datum": "Tid"}, inplace=True)



    # Drops rows where the difference is equal to 0.
    df1filt = df1[(df1["Difference (%)"] != 0)]

    # Converts time-column to only year, month and date.
    df1filt["Tid"] = pd.to_datetime(df1filt["Tid"]).dt.strftime("%Y%m%d").astype(str)

    df1filt.reset_index(level=0, inplace=True)

    #Renames the index column to "row" in order to later use the "row" column 
    df1filt.rename(columns={"index": "row"}, inplace=True)

    # Creates a new column for the difference in total driven kilometers (used for matching)
    df1filt["Match"] = df1filt["Vägmätare (km)"]
    df1filt["Match"] = df1filt.loc[:, "Vägmätare (km)"].diff()

    #Merges refuels that are previously seperated because of the timeintervals. For example when a refuel takes a lot of time and gets split into two different refuels.
    ROWRANGE = len(df1filt)+1
    thevalue = 0
    for currentrow in range(ROWRANGE-1):
        if df1filt.loc[currentrow, 'Difference (%)'] >= 0.0 and df1filt.loc[currentrow-1, 'Difference (%)'] <= 0:
            thevalue = 0
            thevalue += df1filt.loc[currentrow,'Difference (%)']
            df1filt.loc[currentrow,'Match'] = "SUMMED"
        if df1filt.loc[currentrow, 'Difference (%)'] >= 0.0 and df1filt.loc[currentrow-1, 'Difference (%)'] >= 0:
            thevalue += df1filt.loc[currentrow,'Difference (%)']
        if df1filt.loc[currentrow, 'Difference (%)'] <= 0.0 and df1filt.loc[currentrow-1, 'Difference (%)'] >= 0:
            df1filt.loc[currentrow-1,'Difference (%)'] = thevalue
            df1filt.loc[currentrow-1,'Match'] = "OFFICIAL"        
            thevalue = 0

    #Removes single "refuels" that are lower than 5
    df1filt = df1filt[(df1filt['Difference (%)'] > 5)]

    #Creates a new dataframe for the summed values 
    df1filt2 = df1filt[(df1filt['Match'] == "OFFICIAL")]

    #Creates a estimated refueled amount column for the automatic
    df1filt2["Apparent refuel"] = df1filt2["Difference (%)"]
    df1filt2["Apparent refuel"] = df1filt2.loc[:, "Difference (%)"]/100 *fuelcapacity

    #Renames total kilometer column so that the two documents can match
    df1filt2.rename(columns={"Vägmätare (km)": "Mätarställning"}, inplace=True)

    #Filters out rows where refuel and kilometer = NaN (Manual)

    df2.rename(columns={"x": "Actual refuel"}, inplace=True)
    df2.rename(columns={"x.1": "Mätarställning"}, inplace=True)
    df2.rename(columns={"Datum.1": "Tid"}, inplace=True)
    for col in df2.columns:
        print(col)
    input()
    df2filt = df2[(df2['Actual refuel'] != NaN) & (df2['Mätarställning'] != NaN)]


    #Drops first row
    df2filt.drop(df2filt.index[0], inplace=True)

    #Adds prefix for the time column so that they match (not used anymore because km is used to match)
    df2filt['Tid'] = '20' + df2filt['Tid'].astype(str) 

    #Rounds numeric columns
    decimals = 0
    
    df2filt['Mätarställning'] = pd.to_numeric(df2filt['Mätarställning'],errors='coerce')
    df2filt['Actual refuel'] = pd.to_numeric(df2filt['Actual refuel'],errors='coerce')
    df2filt['Mätarställning'] = df2filt['Mätarställning'].apply(lambda x: round(x, decimals))
    df2filt['Actual refuel'] = df2filt['Actual refuel'].apply(lambda x: round(x, decimals))
    

    #This is my temporary fix to the problem
    df2filt['Mätarställning'] //= 100
    df1filt2['Mätarställning'] //= 100

    merged_df = df1filt2.merge(df2filt, on='Mätarställning')

    merged_df.to_excel(filepathname3, index = False)

Hopefully this was enough information. Thank you in advance!

~UPDATE (RESULT/ QUESTIONING)~

Thank you @Rob Raymond for solving my problem. However with the solution provided the problem has been solved but at the same time new problems have occurred. The solution below solves the "merge"-problem partially.

The output now on some places seem to duplicate and rows seem to merge on the same Meter_indication. Im not sure what is causing this but it may have something to do with how "merge_asof" works. I have much data meaning that some data will have different amount of digits. Im not very familiar with the "merge_asof" method but one guess is that it merges on numbers that are similar but arent exactly the same. With other words, if this is the case, "1234" will merge with "12345" because both of them contain "1234".

Here are some graphs I have created using the data to clarify as well as illustrate the problem.

Current graph output:

GRAPH 1 (CURRENT)

Desired graph output:

GRAPH 2 (DESIRED)

With the data (used for graph) sometimes looks like this:

DATA USED FOR CURRENT GRAPH (where yellow numbers are "fuel1" and the numbers to the right of it are "Meter_indication")

Upvotes: 2

Views: 2351

Answers (1)

Rob Raymond
Rob Raymond

Reputation: 31226

  • merge_asof() does work. matches your expected output
  • have not made assumptions about sorting, so did sort_values()
  • for transparency include right Meter_indication in output
df1 = pd.read_csv(io.StringIO("""Meter_indication      Fuel1
1180784               275
1181278               280
1181791               300
1182285               280
1182801               300
1183295               280
1183717               250"""), sep="\s+")

df2 = pd.read_csv(io.StringIO("""Meter_indication      Fuel2
1180785               278
1181282               282
1181800               310
1182401               282
1182824               320
1183310               215
1183727               250"""), sep="\s+")

pd.merge_asof(
    df1.sort_values("Meter_indication"),
    df2.sort_values("Meter_indication").assign(mi=lambda d: d["Meter_indication"]),
    on="Meter_indication",
    direction="nearest",
)

Meter_indication Fuel1 Fuel2 mi
1180784 275 278 1180785
1181278 280 282 1181282
1181791 300 310 1181800
1182285 280 282 1182401
1182801 300 320 1182824
1183295 280 215 1183310
1183717 250 250 1183727

Upvotes: 4

Related Questions