TylerNG
TylerNG

Reputation: 941

combine multiple excel files with similar names

I have a somewhat general question about combining multiple excel files together. Normally, I would use pd.read_excel to read the files then concat to join. However, I have some cases where the field names are not exactly the same but similar. For example,

one sheet would have fields like: Apple, Orange, Size, Id

another sheet would be: Apples, orange, Sizes, #

I have use the rename columns function but with this I have to check and compare every names in each files. I wonder if there's any way to combine them without going through all the field names. Any thought? THANKS!

Upvotes: 0

Views: 318

Answers (1)

Nathan
Nathan

Reputation: 10306

Define what it means for two strings to be the same, then you can do the renaming automatically (you'll also need to determine what the "canonical" form of the string is - the name that you'll actually use in the final dataframe). The problem is pretty general, so you'll have to decide based on the sort of column names that you're willing to consider the same, but one simple thing might be to use a function like this:

def compare_columns(col1: str, col2: str) -> bool:
    return col1.lower() == col2.lower()

Here you'd be saying that any two columns with the same name up to differences in case are considered equal. You'd probably want to define the canonical form for a column to be all lowercase letters.

Actually, now that I think about it, since you'll need a canonical form for a column name anyway, the easiest approach would probably be, instead of comparing names, to just convert all names to canonical form and then merge like usual. In the example here, you'd rename all columns of all your dataframes to be their lowercase versions, then they'll merge correctly.

The hard part will be deciding what transforms to apply to each name to get it into canonical form. Any transformation you do has a risk of combining data that wasn't mean to be (even just changing the case), so you'll need to decide for yourself what's reasonable to change based on what you expect from your column names.

As @ako said, you could also do this with something like Levenstein distance, but I think that will be trickier than just determining a set of transforms to use on each column name. With Levenstein or similar, you'll need to decide which name to rename to, but you'll also have to track all names that map to that name and compute the Levenstein distance between the closest member of that group when deciding if a new name maps to that canonical name (e.g. say that you have "Apple" and "Aple" and "Ale" and are merging names with edit distance of 1 or less. "Apple" and "Aple" should be merged, as should "Aple" and "Ale". "Apple" and "Ale" normally shouldn't be (as their distance is 2), but because they both merge with "Aple", they also merge with each other now).

You could also look into autocorrect to try to convert things like "Aple" to "Apple" without needing "Ale" to also merge in; I'm sure there's some library for doing autocorrect in Python. Additionally, there are NLP tools that will help you if you want to do stemming to try to merge things like "Apples" and "Apple".

But it'll all be tricky. Lowercasing things probably works, though =)

Upvotes: 1

Related Questions