Reputation: 13
I have a large number of text files with data; each file can be imported into excel separately. However, while most of the columns are the same between the files, in many files there's a column or two added/missing so when I merge all the text files and put it into excel, many columns of data are shifted.
I can make a 'master list' of all the possible data entries, but I'm not exactly sure how to tell excel to put certain types of data in specific columns.
For instance, if I have two files that look like:
and
How would I go about merging them like this in excel:
Upvotes: 0
Views: 398
Reputation: 15533
Question: Merging inconsistent data in text files into a single excel spreadsheet
This solution is using the following build-in
and moudules
:
The core of this solution is to normalize the columns names using a set()
object and
the parameter .DictWriter(..., extrasaction='ignore')
to handle the inconsistent columns.
The output format is CSV
, which can be read from MS-Excel.
The given data, separated by blank
text1 = """Name Year Food Color
Bob 2018 Cake Blue
Charlie 2017 Figs Red
"""
text2 = """LastName Name Age Year Color Size
Lily James 17 2021 green 0
"""
Open three files an get the headers.
Aggregate all columns names, drop double names using a set()
.
Create a DictReader
object for the in_*
files.
Note: Replace
io.StringIO(...
withopen(<Path to file>)
with io.StringIO(text1) as in_text1, \
io.StringIO(text2) as in_text2, \
io.StringIO() as out_csv:
columns = set()
reader = []
for n, fh in enumerate([in_text1, in_text2]):
fieldnames = fh.readline().rstrip().split()
[columns.add(name) for name in fieldnames]
reader.append(csv.DictReader(fh, delimiter=' ', fieldnames=fieldnames))
Create a DictWriter
object using the normalized column names.
The parameter extrasaction='ignore'
, handle the inconsistent columns.
Note: The column order is not guaranteed. If you need a defined order, sort the
list(columns)
to your needs before assigning tofieldnames=
.
writer = csv.DictWriter(out_csv, fieldnames=list(columns), , extrasaction='ignore')
writer.writeheader()
Loop all DictReader
objects reading all lines and write it to the target .csv
file.
for dictReader in reader:
for _dict in dictReader:
writer.writerow(_dict)
Output:
print(out_csv.getvalue()) Color,LastName,Year,Food,Age,Name,Size Blue,,2018,Cake,,Bob, Red,,2017,Figs,,Charlie, green,Lily,2021,,17,James,0
Tested with Python: 3.4.2
Upvotes: 1
Reputation: 11242
If you were happy to work with the text files directly in Excel ... this will work but may need some refinement from yourself.
I understand it’s probably not what you’re looking for but it provides another option.
Open the Visual Basic editor, add a new module and copy the below code and paste in ...
Public Sub ReadAndMergeTextFiles()
Dim strSrcFolder As String, strFileName As String, strLine As String, strPath As String, bFirstLine As Boolean
Dim arrHeaders() As String, lngHeaderIndex As Long, arrFields, i As Long, objDestSheet As Worksheet, bFound As Boolean
Dim objLastHeader As Range, x As Long, lngLastColumn As Long, lngHeaderCol As Long, arrHeaderCols() As Long
Dim lngWriteRow As Long
lngLastColumn = 1
lngWriteRow = 2
Application.EnableEvents = False
Application.ScreenUpdating = False
' Change the sheet name being assigned to your destination worksheet name.
' Alternatively, display a prompt that asks for the sheet or simply uses the active sheet.
Set objDestSheet = Worksheets("Result")
With Application.FileDialog(msoFileDialogFolderPicker)
.Title = "Select Source Folder"
.Show
If .SelectedItems.Count = 1 Then
objDestSheet.Cells.Clear
strSrcFolder = .SelectedItems(1)
strFileName = Dir(strSrcFolder & "\*.txt")
Do While Len(strFileName) > 0
strPath = strSrcFolder & "\" & strFileName
Open strPath For Input As #1
bFirstLine = True
Do Until EOF(1)
Line Input #1, strLine
arrFields = Split(strLine, vbTab, , vbTextCompare)
lngHeaderIndex = -1
For i = 0 To UBound(arrFields)
If bFirstLine Then
' Loop through the header fields already written to the destination worksheet and find a match.
For x = 1 To objDestSheet.Columns.Count
bFound = False
If Trim(objDestSheet.Cells(1, x)) = "" Then Exit For
If UCase(objDestSheet.Cells(1, x)) = UCase(arrFields(i)) Then
lngHeaderCol = x
bFound = True
Exit For
End If
Next
If Not bFound Then
objDestSheet.Cells(1, lngLastColumn) = arrFields(i)
lngHeaderCol = lngLastColumn
lngLastColumn = lngLastColumn + 1
End If
lngHeaderIndex = lngHeaderIndex + 1
ReDim Preserve arrHeaderCols(lngHeaderIndex)
arrHeaderCols(lngHeaderIndex) = lngHeaderCol
Else
' Write out each value into the column found.
objDestSheet.Cells(lngWriteRow, arrHeaderCols(i)) = "'" & arrFields(i)
End If
Next
If Not bFirstLine Then
lngWriteRow = lngWriteRow + 1
End If
bFirstLine = False
Loop
Close #1
strFileName = Dir
Loop
objDestSheet.Columns.AutoFit
End If
End With
Application.ScreenUpdating = True
Application.EnableEvents = True
End Sub
... I did some basic testing with the data you provided and it seemed to work. If for some reason it fails over the data you're using and you can't work it out, let me know and I'll put a fix in.
Some points ...
The order of the columns depends on the order of your files and which columns appear first. Of course, that could be enhanced upon but it is what it is for now.
It assumes all files in the one folder and all files end in .txt
The separator within each file is assumed to be a TAB.
Let me know if that helps.
Upvotes: 0