Reputation: 11
I have a CSV file, around 5k lines, with the following example:
apple,tea,salt,fish
apple,oranges,ketchup
...
salad,oreo,lemon
salad,soda,water
I need to extract only first line matching apple or salad and skip other lines where those words occur.
I can do something like this with regex, "apple|salad", but it will extract all the lines where those words are found.
The desired result is:
apple,tea,salt,fish
salad,oreo,lemon
I'm able to use REGEX in a text editor and OpenOffice Calc application.
Upvotes: 0
Views: 425
Reputation: 91430
^(\w+)(.+\R?)(?:\1(?2))+
$1$2
. matches newline
Explanation:
^ # beginning of line
(\w+) # group 1, 1 or more word character, you can use ([^,\r\n]+) if the the first word contains other characters than "word" characters
( # start group 2
.+ # 1 or more any character but newline
\R? # any kind of linebreak, optional
) # end group 2
(?: # non capture group
\1 # backreference to group 1 (i.e. the same word)
(?2) # reuse the pattern of group 2, i.e. (.+\R?)
)+ # end group, may appear 1 or more times
Screenshot (before):
Screenshot (after):
Upvotes: 1
Reputation: 14047
In Notepad++ repeatedly do a regular expression replace of ^(\w+,)(.*)\R\1.*$
with \1\2
. Have "Wrap around" selected.
Explanation:
^ Match beginning of line
(\w+,) Match the leading word plus comma, save to capture group 1
(.*) Match the rest of the line, save to capture group 2
\R Match a line break
\1 Match the same leading word plus comma
.* Match the rest of the line
$ Match the end of the line
The replace string just keeps the first line, the second line is discarded.
Demonstration:
Starting with:
apple,tea1,salt1,fish1
apple,tea2,salt2,fish2
apple,oranges1,ketchup1
apple,oranges2,ketchup2
apple,oranges3,ketchup3
apple,oranges4,ketchup4
salad,oreo1,lemon1
salad,oreo2,lemon2
salad,soda1,water1
salad,soda2,water2
Doing a "Replace all" with the above expressions yields:
apple,tea1,salt1,fish1
apple,oranges1,ketchup1
apple,oranges3,ketchup3
salad,oreo1,lemon1
salad,soda1,water1
Two more clicks on "Replace all" yield:
apple,tea1,salt1,fish1
salad,oreo1,lemon1
Each press of "Replace all" removes approximately half of the unwanted lines.
Upvotes: 0
Reputation: 5688
You could use the great Miller, and run
mlr --nidx --fs "," filter '$1=~"(apple|salad)"' then head -n 1 -g 1 input.csv
to have
apple,tea,salt,fish
salad,oreo,lemon
--nidx
, to set the format, a generic index format--fs ","
, to set the separatorfilter '$1=~"(apple|salad)"'
, to apply the regex filter to the first fieldthen head -n 1 -g 1
, to take the first record, based on the value of the first fieldUpvotes: 1