Haifeng Zhang
Haifeng Zhang

Reputation: 31925

Why those two sed commands get different result?

A csv file example.csv, it has

hello,world,wow
this,is,amazing

I want to get the first column elements, at the beginning I wrote a sed command like:

sed -n 's/\([^,]*\),*/\1/p' example.csv

output:

helloworld,now
thisis,amazing

Then I modified my command to the following and get what I want:

sed -n 's/\([^,]*\).*/\1/p' example.csv

output:

hello
this 

command1 I used comma(,) and command2 I replaced comma with dot(.), and it works as expected, can anyone explain how sed really works to get the 1st output? What's the story behind? Is it because of the dot(.) or because of the substitution group & back-reference?

Upvotes: 1

Views: 73

Answers (5)

raina77ow
raina77ow

Reputation: 106483

In both regexes, ([^,]*) will consume the same part of the string - all the symbols preceding the first encountered comma. Apparently the difference is how are the remaining parts of those regexes treated.

In the first one, it's ,* - zero or more comma symbols. Obviously all it might consume is the comma itself - the rest of the line isn't covered by a pattern.

In the second one, it's .* - zero or more of any symbols. It's not a big surprise that'll cover the remaining string completely - as it has nothing to stop at; any is, well, any. )

In both cases the pattern-covered part of the string is replaced by the contents of the capturing group (and that's, as I said already, 'all the symbols before the first comma') - and what's covered by the remaining part of the regex is just removed. So in first case the very first comma is erased, in the second - the comma and the rest of the string.

Upvotes: 3

Tomas Pastircak
Tomas Pastircak

Reputation: 2857

The reason behind that is that the pattern matches only to the first part of the word, i.e. only the Hello, part is replaced. The part ,* takes arbitrary amount of commas, and then nothing is set to be next, i.e. nothing else matches the pattern. For example:

hello,,,,,,,,,,,,,,,,,,world

would be replaced to

helloworld

A good example would be

sed -n 's/\([^,]*\),*$/\1/p' example.csv

This will work if and only if all the commas are at the end of the line and will trim them, e.g.

hello,,,,,,

Hope this makes the problem a bit clearer.

Upvotes: 1

Jotne
Jotne

Reputation: 41460

If you like first word, why not use awk

awk -F, '{print $1}' file
hello
this

Using sed with back reference

sed -nr 's/([^,]*),.*/\1/p' file
hello
this

It seems that to make it work you need the .* so it get the whole line.
The r option make you not need to escape the parentheses \(

Upvotes: 0

Emmet
Emmet

Reputation: 6421

Can I suggest not using sed?

cut -d, -f1 example.csv

Personally, I'm a huge sed fan, but cut is much more appropriate in this instance.

Upvotes: 0

frlan
frlan

Reputation: 7270

On regex the . (dot) is a place holder for one, single character.

Upvotes: 0

Related Questions