How to efficiently match strings across two dataframes

Question

This is a question about theory as much as it's one about syntax. I'm writing a program to classify transactions in my bank statements, and I'm stuck on the best way on doing it.

I have a series of 'annual statement' dataframes that each contain a year's worth of transactions. These include the columns:

| description      | type           |
| sainsbury's      |                |
| js sainsbury     |                |
| amzn marketplace |                |
| mbna             |                |
| amazon           |                |
| netflix          |                |
| gap              |                |
| vue              |                |

Where description is the name of a vendor that took part in a transaction and type is an empty column that I want to fill.

I also have a 'classifications' dataframe, which looks like:

| type           | search term      |
| groceries      | sainsbury        |
| amazon         | amzn marketplace |
| amazon         | amazon           |
| subscriptions  | netflix          |
| clothes        | gap              |
| luxuries       | vue              |

Where type is a set of things I've decided provide a nice summary of what I spend my money on, and search term is a term that can be used to identify which transactions relate to which classifications.

Basically, I need to match the search terms in the classifications dataframe to the description column in the annual statement dataframe, then write the corresponding entry from the type column in the classifications dataframe to the type column in the annual statements dataframe. In other words, I'd end up with this (note that where there was no type for 'mbna' in the classifications dataframe, the type column is left blank):

| description      | type           |
| sainsbury's      | groceries      |
| js sainsbury     | groceries      |
| amzn marketplace | amazon         |
| mbna             |                |
| amazon           | amazon         |
| netflix          | subscriptions  |
| gap              | clothes        |
| vue              | luxuries       |

As for scale, there's thousands of rows in each annual statement dataframe, around ten different types, and probably less than a hundred search terms.

I've had a few thoughts about how to do this, but I'm a beginner with Pandas so I'd appreciate some expert help!

Daniel Geffen · Accepted Answer

This seems like a pretty simple merge. You can specify different columns to match in the different dataframes. Something like this should work:

result_df = annual_statemenets_df.merge(classifications_df, left_on="description", right_on="search term")

If you are concerned about efficiency you could set the matching columns as indexes and use join, which is faster (as stated here).

If you want to match parts of strings with your search terms you should use replace, which supports regex. First you need to convert your search terms to regexes that select the whole phrase when finding the term in it, like this:

classifications_df["search term"] = ".*" + classifications_df["search term"] + ".*"

Then replace will replace the whole phrase:

to_replace = classifications_df["search term"].tolist()
replace_with = classifications_df["type"].tolist()

annual_statemenets_df["type"] = annual_statemenets_df["description"].replace(to_replace, replace_with, regex=True)

How to efficiently match strings across two dataframes

Answers (1)

Related Questions