Angel
Angel

Reputation: 2875

What is the best way to integrate different Excel files with different sheets with different formats in Python?

I have multiple Excel files with different sheets in each file, these files have been made my people, so each one has different formats, different number of columns and also different structures to represent the data. For example, in one sheet, the dataframe/table starts at 8th row, second column. In other it starts at 122 row, etc...

I want to retrieve something in common from these Excels, it is variable names and information.

However, I don't how could I possibly retrieve all this information without needing to parse each individual file. This is not an option because there are lot of these files with lots of sheets in each file.

I have been thinking about using regex as well as edit distance between words, but I don't know if that is the best option.

Any help is appreciated.

Upvotes: 0

Views: 65

Answers (1)

Mateo Torres
Mateo Torres

Reputation: 1625

I will divide my answer into what I think you can do now, and suggestions for the future (if feasible).

An attempt to "solve" the problem you have with existing files.

Without regularity on your input files (such as at least a common name in the column), I think what you're describing is among the best solutions. Having said that, perhaps a "fancier" similarity metric between column names would be more useful than using regular expressions.

If you believe that there will be some regularity in the column names, you could look at string distances such as the Hamming Distance or the Levenshtein distance, and using a threshold on the distance that works for you. As an example, let's say that you have a function d(a:str, b:str) -> float that calculates a distance between column names, you could do something like this:


# this variable is a small sample of "expected" column names
plausible_columns = [
    'interesting column',
    'interesting',
    'interesting-column',
    'interesting_column',
]

for f in excel_files:
    # process the file until you find columns 
    # I'm assuming you can put the colum names into 
    # a variable `columns` here. 
    for c in columns:
        for p in plausible_columns:
            if d(c,p) < threshold:
                # do something to process the column, 
                # add to a pandas DataFrame, calculate the mean, 
                # etc.

If the data itself can tell you something on whether you should process it (such as having a particular distribution, or being in a particular range), you can use such features to decide on whether you should be using that column or not. Even better, you can use many of these characteristics to make a finer decision.

Having said this, I don't think a fully automated solution exists without inspecting some of the data manually, and studying the ditribution of the data, or variability in the names of the columns, etc.

For the future

Even with fancy methods to calculate features and doing some data analysis on the data you have right now, I think it would be impossible to ensure that you will always get the data you need (by the very nature of the problem). A reasonable way to solve this, in my opinion (and if this is feasible in whatever context you're working in), is to impose a stricter format in the data generation end (I suppose this is a manual thing with people inputting data to excel directly). I would argue that the best solution is to get rid of the problem at the root, and create a unified form, or excel sheet format, and distribute it to the people that will fill the files with data, so that you can ensure the data is then automatically ingested minimizing the risk of errors.

Upvotes: 1

Related Questions