Reputation: 209
I was assigned to come up with an algorithm on how to read a template (Excel) and extract the headers/column names from the data itself.
The following must be taken into account:
There can be multiple headers/column names in just one sheet of an Excel file.
Headers can be horizontal AND/OR vertical in nature. This means that there could be a mixture of vertical and horizontal headers in one sheet.
Headers dont necessarily have to be at the very first row of the file. There could be introductions or banner images there.
The system must allow ANY kind of Excel format, so there is no control over the formatting of the cells, the naming convention, etc.
Some headers are alphanumeric in nature, which means it also contains numbers.
Some cells are merged to make room for a specific header.
Any ideas or suggestions?
Upvotes: 1
Views: 1917
Reputation: 343
The solution to this problem involves taking away two of these freedoms. Such constraints applied will make this a tractable problem. Most of such freedoms come from overcautious thinking. The freedoms are given as quotes below:-
Headers can be horizontal AND/OR vertical in nature. This means that there could be a mixture of vertical and horizontal headers in one sheet.
Typically, vertical headers are not used in Excel Files where there is a need to programmatically detect headers. As the primary, most common and sometimes the only reason for such detection is to upload/transform the tabular data.
Funny things happen when vertical headers are introduced:
Staying true, to the core need for autodetection of headers, we can state that once our requirement states that Headers can be placed only in a horizontal alignment, the solution becomes slightly more tractable but not fully so.
Some cells are merged to make room for a specific header.
Merging cells is poison and anathema to the entire reason for transformation/upload of data. This is a pill I steadfastly have refused to take in my entire career with Excel & SQL jugglery. You may kindly merge all that you want to for all I care, however thee shall not pass into my beloved SQL Server.
For aforementioned reasons of prejudice and ill-will towards all mergers and mergees alike. I'd respectfully suggest that you too take this course.
Staying true to the above requirements after taking away the 2 freedoms. The pseudo algorithm (solution) is to
Take a sample of say c x r Excel Rows. For eg: 200 x 201 rows and columns
Find the counts of non-empty cells using an inbuilt formula like COUNTA whose contents have a non-zero length. The Count of such non-empty cells in each row is maintained as a data structure.
The type of data ie:- Number, Date, String should also be maintained in the above data structure capable of expressing the following:
Row# 22 contains
30 non-empty cells of which
28 are alphanumeric,
1 is a Date and
1 is a Number.
The First specific row that contains the maximum number of such non empty cells with the maximum number of strings should very likely be the header row.
Converting all of the above to a specific algorithm in any given language should be a deliciously occupying task for any young developer in their prime.
Upvotes: 0
Reputation: 86600
(I know nothing about Apache, but some about Excel Interop working)
If the sheets to be detected are yours, I'd recomend NAMING those header cells. (To name a cell in Excel, there's a field at the top left of the screen, where normally the cell coordinates appear (like "A1" or "B2" and so...). Type a name in that place, and you will be able to identify that cell via code by it's name. ( 'Worksheet.Range("Name")' is where you get those cells via code)
To manage names, go to "Insert - Names" or "Formulas - Name manager", depending on what version of excel.
(Personally, I never work with sheets via code without naming headers, then I use "Offset" to get the data cells corresponding to those headers - This allows me to freely edit the sheet later without breaking the code)
If the sheets aren't yours, then, you'll need to find out the extents of the data. (Last row and last column) Then check for the first line that contains all columns filled, none of them blank. That's a probable horizontal header. As well as check for the first columns that contains all lines filled. That's a probable vertical header.
You could, as well, search for completely blank lines and/or columns to find headers that are AFTER some data, in case of sheets containing multiple horizontal headers, or vertical.
You could use some formatting properties (Range.Interior or Range.Font for examples) of those cells to identify if they are headers (usually headers have different format, color, borders and so on).
If you're sure there's no numeric header, I mean, all headers contains text, check for the type of data in the cells. If all are strings, header probability increases.
Even so, that's a tricky thing to do, if sheets don't follow some pattern, once in a while one of them can deceive your code and bring false results. I'd recommend, if alowed, to add a human verification to confirm the results after the proccess is done.
Upvotes: 4