Auraius
Auraius

Reputation: 13

Merging inconsistent data in text files into a single excel spreadsheet

I have a large number of text files with data; each file can be imported into excel separately. However, while most of the columns are the same between the files, in many files there's a column or two added/missing so when I merge all the text files and put it into excel, many columns of data are shifted.

I can make a 'master list' of all the possible data entries, but I'm not exactly sure how to tell excel to put certain types of data in specific columns.

For instance, if I have two files that look like:

and

How would I go about merging them like this in excel:

Upvotes: 0

Views: 398

Answers (2)

stovfl
stovfl

Reputation: 15533

Question: Merging inconsistent data in text files into a single excel spreadsheet

This solution is using the following build-in and moudules:

The core of this solution is to normalize the columns names using a set() object and
the parameter .DictWriter(..., extrasaction='ignore') to handle the inconsistent columns.

The output format is CSV, which can be read from MS-Excel.


  • The given data, separated by blank

    text1 = """Name Year Food Color
    Bob 2018 Cake Blue
    Charlie 2017 Figs Red
    """
    text2 = """LastName Name Age Year Color Size
    Lily James 17 2021 green 0
    """
    
  • Open three files an get the headers.
    Aggregate all columns names, drop double names using a set().
    Create a DictReader object for the in_* files.

    Note: Replace io.StringIO(... with open(<Path to file>)

    with io.StringIO(text1) as in_text1, \
         io.StringIO(text2) as in_text2, \
         io.StringIO() as out_csv:
    
        columns = set()
        reader = []
        for n, fh in enumerate([in_text1, in_text2]):
            fieldnames = fh.readline().rstrip().split()
            [columns.add(name) for name in fieldnames]
            reader.append(csv.DictReader(fh, delimiter=' ', fieldnames=fieldnames))
    
  • Create a DictWriter object using the normalized column names. The parameter extrasaction='ignore', handle the inconsistent columns.

    Note: The column order is not guaranteed. If you need a defined order, sort the list(columns) to your needs before assigning to fieldnames=.

        writer = csv.DictWriter(out_csv, fieldnames=list(columns), , extrasaction='ignore')
        writer.writeheader()
    
  • Loop all DictReader objects reading all lines and write it to the target .csv file.

        for dictReader in reader:
            for _dict in dictReader:
                writer.writerow(_dict)
    

Output:

print(out_csv.getvalue())

Color,LastName,Year,Food,Age,Name,Size
Blue,,2018,Cake,,Bob,
Red,,2017,Figs,,Charlie,
green,Lily,2021,,17,James,0

Tested with Python: 3.4.2

Upvotes: 1

Skin
Skin

Reputation: 11242

If you were happy to work with the text files directly in Excel ... this will work but may need some refinement from yourself.

I understand it’s probably not what you’re looking for but it provides another option.

Open the Visual Basic editor, add a new module and copy the below code and paste in ...

Public Sub ReadAndMergeTextFiles()
    Dim strSrcFolder As String, strFileName As String, strLine As String, strPath As String, bFirstLine As Boolean
    Dim arrHeaders() As String, lngHeaderIndex As Long, arrFields, i As Long, objDestSheet As Worksheet, bFound As Boolean
    Dim objLastHeader As Range, x As Long, lngLastColumn As Long, lngHeaderCol As Long, arrHeaderCols() As Long
    Dim lngWriteRow As Long

    lngLastColumn = 1
    lngWriteRow = 2

    Application.EnableEvents = False
    Application.ScreenUpdating = False

    ' Change the sheet name being assigned to your destination worksheet name.
    ' Alternatively, display a prompt that asks for the sheet or simply uses the active sheet.
    Set objDestSheet = Worksheets("Result")

    With Application.FileDialog(msoFileDialogFolderPicker)
        .Title = "Select Source Folder"
        .Show

        If .SelectedItems.Count = 1 Then
            objDestSheet.Cells.Clear

            strSrcFolder = .SelectedItems(1)

            strFileName = Dir(strSrcFolder & "\*.txt")

            Do While Len(strFileName) > 0
                strPath = strSrcFolder & "\" & strFileName

                Open strPath For Input As #1

                bFirstLine = True

                Do Until EOF(1)
                    Line Input #1, strLine

                    arrFields = Split(strLine, vbTab, , vbTextCompare)

                    lngHeaderIndex = -1

                    For i = 0 To UBound(arrFields)
                        If bFirstLine Then
                            ' Loop through the header fields already written to the destination worksheet and find a match.
                            For x = 1 To objDestSheet.Columns.Count
                                bFound = False

                                If Trim(objDestSheet.Cells(1, x)) = "" Then Exit For

                                If UCase(objDestSheet.Cells(1, x)) = UCase(arrFields(i)) Then
                                    lngHeaderCol = x
                                    bFound = True
                                    Exit For
                                End If
                            Next

                            If Not bFound Then
                                objDestSheet.Cells(1, lngLastColumn) = arrFields(i)
                                lngHeaderCol = lngLastColumn
                                lngLastColumn = lngLastColumn + 1
                            End If

                            lngHeaderIndex = lngHeaderIndex + 1
                            ReDim Preserve arrHeaderCols(lngHeaderIndex)
                            arrHeaderCols(lngHeaderIndex) = lngHeaderCol
                        Else
                            ' Write out each value into the column found.
                            objDestSheet.Cells(lngWriteRow, arrHeaderCols(i)) = "'" & arrFields(i)
                        End If
                    Next

                    If Not bFirstLine Then
                        lngWriteRow = lngWriteRow + 1
                    End If

                    bFirstLine = False
                Loop

                Close #1

                strFileName = Dir
            Loop

            objDestSheet.Columns.AutoFit
        End If
    End With

    Application.ScreenUpdating = True
    Application.EnableEvents = True
End Sub

... I did some basic testing with the data you provided and it seemed to work. If for some reason it fails over the data you're using and you can't work it out, let me know and I'll put a fix in.

Some points ...

  1. The order of the columns depends on the order of your files and which columns appear first. Of course, that could be enhanced upon but it is what it is for now.

  2. It assumes all files in the one folder and all files end in .txt

  3. The separator within each file is assumed to be a TAB.

Let me know if that helps.

Upvotes: 0

Related Questions