Reputation: 23
Now I have a list of files and I want to combine files targeting a same serial number. Each file contains thousands of lines and each line has such a format: date, count, reading.
For example the first file:
"2019-12-23 00:00:00",1123,211685,34650.75,33225.69,...(hundreds of similar numbers)
...(hundreds of similar lines)
"2020-02-23 06:00:00",1372,211685,34651.22,33224.6,...
"2020-02-23 12:00:00",1373,211685,34650.34,33224.74,...
The 2nd file:
"2019-12-17 12:00:00",1101,211685,34649.3,33225.8...
...
"2020-02-15 00:00:00",1339,211685,34651.66,33225.32,...
"2020-02-15 06:00:00",1340,211685,34651.63,33225.19...
The problem is, the missing lines can be either in the beginning or at the end of the file. Initial 100 readings might be missing in one file while the other file may miss the latest 50 readings. I wonder what could be the best way to merge them? I can think of using "set", but I'm not sure if it is capable of adding missing lines in the middle of a file.
An example of completed lines:
"2019-12-17 12:00:00",1101,211685,34649.3,33225.8...
...
"2019-12-23 00:00:00",1123,211685,34650.75,33225.69,...
...
"2020-02-15 00:00:00",1339,211685,34651.66,33225.32,...
"2020-02-15 06:00:00",1340,211685,34651.63,33225.19...
...
"2020-02-23 06:00:00",1372,211685,34651.22,33224.6,...
"2020-02-23 12:00:00",1373,211685,34650.34,33224.74,...
Upvotes: 1
Views: 532
Reputation: 98921
You can try using:
from datetime import datetime
from pprint import pprint
files = ["merge_01.txt", "merge_02.txt"]
all_lines = []
for file in files:
with open(file) as f:
all_lines += [x.strip() for x in f]
all_lines = list(set(all_lines))
all_lines.sort(key=lambda date: datetime.strptime(date[1:20], "%Y-%m-%d %H:%M:%S"))
pprint(all_lines)
with open("merge_all.txt", "w") as f:
for line in all_lines:
f.write(line+"\n")
['"2019-12-17 12:00:00",1101,211685,34649.3,33225.8',
'"2019-12-23 00:00:00",1123,211685,34650.75,33225.69',
'"2020-02-15 00:00:00",1339,211685,34651.66,33225.32',
'"2020-02-15 06:00:00",1340,211685,34651.63,33225.19',
'"2020-02-23 06:00:00",1372,211685,34651.22,33224.6',
'"2020-02-23 12:00:00",1373,211685,34650.34,33224.74']
Pandas Solution:
import pandas as pd
files = ["merge_01.txt", "merge_02.txt"]
all_lines = []
for file in files:
with open(file) as f:
all_lines += list([x.strip().replace("\"", "") for x in f])
df = pd.DataFrame([sub.split(",") for sub in all_lines], columns=["date", "field1", "field2", "field3", "field4"])
df['date'] = pd.to_datetime(df['date'])
df = df.sort_values(by='date').drop_duplicates()
df.to_csv('merged.csv', index=False)
Upvotes: 1
Reputation: 77347
set
doesn't maintain order but you can sort it later to get the output file you want. When a date string is written as year-month-day-hour-minute-second in UTC, then it can be sorted either highest to lowest or lowest to highest without any date conversion. Write it in American "June 31 2019 12:30 PM MST" and you'd need to convert to something sortable.
def merge_files(filenames, outfilename):
rows = set()
for filename in filenames:
rows.update(open(filename))
with open(outfilename, 'w') as fp:
fp.writelines(sorted(rows))
Upvotes: 1