Find duplicate using regex in a txt file opened with Sublime Text 3 editor

Question

I have the following table, written in a txt file.

+----------------+---------------+------------+
| Reference Date | Instrument ID | Entity ID  |
+----------------+---------------+------------+
| 2019-06-28     | 4251675720    | 1000183742 |
+----------------+---------------+------------+
| 2019-06-28     | 4251675720    | 1000183742 |
+----------------+---------------+------------+
| 2019-06-28     | 2113750655    | 100065856  |
+----------------+---------------+------------+
| 2019-06-28     | 3512075270    | 1002923999 |
+----------------+---------------+------------+
| 2019-06-28     | 4251998103    | 1003890261 |
+----------------+---------------+------------+
| 2019-06-28     | 4239113350    | 1004043945 |
+----------------+---------------+------------+
| 2019-06-28     | 8569030255    | 1004043945 |
+----------------+---------------+------------+
| 2019-06-28     | 6692802619    | 1004584989 |
+----------------+---------------+------------+
| 2019-06-28     | 6751615521    | 1005048991 |
+----------------+---------------+------------+
| 2019-06-28     | 1338818134    | 1005076529 |
+----------------+---------------+------------+
| 2019-06-28     | 1903780287    | 1005519781 |
+----------------+---------------+------------+
| 2019-06-28     | 3023132803    | 1005535434 |
+----------------+---------------+------------+
| 2019-06-28     | 3075990149    | 1006443568 |
+----------------+---------------+------------+
| 2019-06-28     | 1821112520    | 1007165898 |
+----------------+---------------+------------+
| 2019-06-28     | 4249904989    | 100753094  |
+----------------+---------------+------------+
| 2019-06-28     | 4230960972    | 1009300504 |
+----------------+---------------+------------+
| 2019-06-28     | 2254190165    | 1010611747 |
+----------------+---------------+------------+

The file looks like this:

The txt file is opened from Sublime Text 3 (file Editor).

My problem: I don't want to have duplicates and I thought since Sublime Text supports the Find/Replace functionality with regex to find those duplicates and remove them by hand. Specifically, I want to find the duplicates in columns Instrument ID, Entity ID. For example, if you check the first two rows both have the same pair (Instrument ID, Entity ID). Using regex I want to find the rest of the rows with the same pair and by hand remove the second row.

Keep in mind that in my txt file the syntax is: 1000183742 1006443568 (6 spaces+1 tab between the columns). So with regex I am looking for the same pair with

d{10}\s{6} {1}\d{10} -> 10 digits, followed by 6 spaces, followed by 1 tab, followed by 10 digits

The fourth bird · Accepted Answer

You can capture the digits in 2 capturing groups and assert that they occur at the right.

\b(\d{10}) {6}	(\d{10})\b(?=[\s\S]*\b\1 {6}	\2)\b

\b(\d{10}) Word boundary, capture 10 digits in group 1
{6} Match 6 spaces and a tab
(\d{10})\b Capture 10 digits in group 2 and word boundary
(?= Positive lookahead, assert that what is at the right contains
- [\s\S]* Match any char 0+ times
\b\1 {6} \2)\b Match the exact matched values in group 1 and 2 using a backreference
) Close lookahead

Regex demo

You could also switch it around and get the matches that do not have duplicate value at the right using a negative lookahead instead. Note that I have used \d{9,10} as not all values are 10 digits

\b\d{4}-\d{2}-\d{2}[ 	]+\b(\d{9,10}) {6}	(\d{9,10})\b(?![\s\S]*\b\1

Regex demo

Find duplicate using regex in a txt file opened with Sublime Text 3 editor

Answers (1)

Related Questions