Reputation: 25
I have a 130k line text file with patent information and I just want to keep the dates (regex "[0-9]{4}-[0-9]{2}-[0-9]{2} "
) for subsequent work in Excel. For this purpose I need to keep the line structure intact (also blank lines). My main problem is that I can't seem to find a way to identify and keep multiple occurrences of date information in the same line while deleting all other information.
Original file structure:
US20110228428A1 | US | | 7 | 2010-03-19 | SEAGATE TECHNOLOGY LLC US20120026629A1 | US | | 7 | 2010-07-28 | TDK CORP | US20120127612A1 | US | | EXAMINER | 2010-11-24 | | US20120147501A1 | US | | 2 | 2010-12-09 | SAE MAGNETICS HK LTD,HEADWAY TECHNOLOGIES INC
Desired file structure:
2010-03-19 2010-07-28 2010-11-24 2010-12-09
Thank you for your help!
Upvotes: 1
Views: 729
Reputation: 93026
Search for
.*?(?:([0-9]{4}-[0-9]{2}-[0-9]{2})|$)
And replace with
" $1"
Don't put the quotes, just to show there is a space before the $1
. This will also put a space before the first match in a row.
This regex will match as less as possible .*?
before it finds either the Date or the end of the row (the $
). If a date is found it is stored in $1
because of the brackets around. So as replacement just put a space to separate the found dates and then the found date from $1
.
Upvotes: 3