Merging inconsistent data in text files into a single excel spreadsheet

Question

I have a large number of text files with data; each file can be imported into excel separately. However, while most of the columns are the same between the files, in many files there's a column or two added/missing so when I merge all the text files and put it into excel, many columns of data are shifted.

I can make a 'master list' of all the possible data entries, but I'm not exactly sure how to tell excel to put certain types of data in specific columns.

For instance, if I have two files that look like:

Name Year Food Color
Bob 2018 Cake Blue
Charlie 2017 Figs Red

and

LastName Name Age Year Color Size
Lily James 17 2021 green 0

How would I go about merging them like this in excel:

LastName Name Age Year Food Color Size
na Bob na 2018 Cake Blue na
na Charlie na 2017 Figs Red na
Lily James 17 2021 na green 0

stovfl · Accepted Answer

Question: Merging inconsistent data in text files into a single excel spreadsheet

This solution is using the following build-in and moudules:

The core of this solution is to normalize the columns names using a set() object and
the parameter .DictWriter(..., extrasaction='ignore') to handle the inconsistent columns.

The output format is CSV, which can be read from MS-Excel.

The given data, separated by blank

text1 = """Name Year Food Color
Bob 2018 Cake Blue
Charlie 2017 Figs Red
"""
text2 = """LastName Name Age Year Color Size
Lily James 17 2021 green 0
"""

Open three files an get the headers.
Aggregate all columns names, drop double names using a set().
Create a DictReader object for the in_* files.

Note: Replace io.StringIO(... with open()

with io.StringIO(text1) as in_text1, \
     io.StringIO(text2) as in_text2, \
     io.StringIO() as out_csv:

    columns = set()
    reader = []
    for n, fh in enumerate([in_text1, in_text2]):
        fieldnames = fh.readline().rstrip().split()
        [columns.add(name) for name in fieldnames]
        reader.append(csv.DictReader(fh, delimiter=' ', fieldnames=fieldnames))

Create a DictWriter object using the normalized column names. The parameter extrasaction='ignore', handle the inconsistent columns.

Note: The column order is not guaranteed. If you need a defined order, sort the list(columns) to your needs before assigning to fieldnames=.
```
    writer = csv.DictWriter(out_csv, fieldnames=list(columns), , extrasaction='ignore')
    writer.writeheader()
```

Loop all DictReader objects reading all lines and write it to the target .csv file.

    for dictReader in reader:
        for _dict in dictReader:
            writer.writerow(_dict)

Output:

print(out_csv.getvalue())

Color,LastName,Year,Food,Age,Name,Size
Blue,,2018,Cake,,Bob,
Red,,2017,Figs,,Charlie,
green,Lily,2021,,17,James,0

Tested with Python: 3.4.2

Merging inconsistent data in text files into a single excel spreadsheet

Answers (2)

Related Questions