Reputation: 31
I have a file with 225000 rows that contains a bunch of similar lines. I'm looking to remove all of the similar lines only keeping the first for each of it's "type". Example is below.
I'd like for a file that looks like this:
./ACT_HERE_REPORT_MEMO_APPROVED_20180510_083000.log.gz
./ACT_HERE_REPORT_MEMO_APPROVED_20180512_083000.log.gz
./ACT_HERE_REPORT_MEMO_APPROVED_20180513_083000.log.gz
./ACT_HERE_REPORT_MEMO_APPROVED_20180515_083000.log.gz
./ACT_HERE_SOMETHING_MEMO_APPROVED_20180326.xls
./ACT_HERE_SOMETHING_MEMO_APPROVED_20180327.xls
./ACT_HERE_SOMETHING_MEMO_APPROVED_20180328.xls
./ACT_HERE_SOMETHING_MEMO_APPROVED_20180329.xls
./ACT_HERE_SOMETHING_MEMO_APPROVED_20180331.xls
./Archive/20150919-084501.SOMETHING
./Archive/20150922-084501.SOMETHING
./Archive/20150923-084500.SOMETHING
./Archive/20150924-084500.SOMETHING
./TEST/TEST.20170310.20170310-181017.txt.gz
./TEST/TEST.20170310.20170310-201023.txt.gz
./TEST/TEST.20170313.20170313-011035.txt.gz
./TEST/TEST.20170313.20170313-024006.txt.gz
./TEST/TEST.20170313.20170313-041018.txt.gz
./TEST/TEST.20180402-011024.log.gz
./TEST/TEST.20180402-011200.log.gz
./TEST/TEST.20180402-061113.log.gz
./TEST/TEST.20180402-081013.log.gz
./TEST/TEST.20180402-101012.log.gz
To end up like this:
./ACT_HERE_REPORT_MEMO_APPROVED_20180510_083000.log.gz
./ACT_HERE_SOMETHING_MEMO_APPROVED_20180326.xls
./Archive/20150919-084501.SOMETHING
./TEST/TEST.20170310.20170310-181017.txt.gz
./TEST/TEST.20180402-011024.log.gz
Upvotes: 3
Views: 49
Reputation: 91518
((^.+?)[-_.\d]+(\..+\R))(?:\2[-_.\d]+\3)+
$1
. matches newline
Explanation:
( # start group 1
( # start group 2
^ # beginning of line
.+? # 1 or more any character but newline, not greedy
) # end group 2
[-_.\d]+ # 1 or more hyphen, underscore, dot or digit
( # start group 3
\. # a dot
.+ # 1 or more any character
\R # any kind of linebreak
) # end group 3
) # end group 1
(?: # non capture group
\2 # backreference to group 2
[-_.\d]+ # 1 or more hyphen, underscore, dot or digit
\3 # backreference to group 3
)+ # end group, must appear 1 or more times
Result for given example:
./ACT_HERE_REPORT_MEMO_APPROVED_20180510_083000.log.gz
./ACT_HERE_SOMETHING_MEMO_APPROVED_20180326.xls
./Archive/20150919-084501.SOMETHING
./TEST/TEST.20170310.20170310-181017.txt.gz
./TEST/TEST.20180402-011024.log.gz
Screen capture:
Upvotes: 5