nmh
nmh

Reputation: 501

Delete line from csv from specific field if it contains below x number of strings/words

If I have a csv similar to the following:

1999, random strings go here, £50.00, 983903893
1957, Another lacklustre line, £99.00, 3983093
1987, Adventure UK, £83.83, 39939
1945, North Wales is the Adrenaline Capital of Europe, £78.99, 83983

How can I delete line where field 2 has 3 or less words. So the output would be:

1999, random strings go here, £50.00, 983903893
1945, North Wales is the Adrenaline Capital of Europe, £78.99, 83983

I'm tempted to say I would like to use sed because it can change the file in place witout the need to create a new file.

I know how to refer to the 2nd field and remove individual words or symbols with sed from a csv. For example I could use this to remove a question mark:

sed -ri ':b s/^([^,]*,[^,]*)\?/\1 /g; t b'

But how do I tell said that I want to remove the entire line if there are three or less words? This is the bit i'm struggling with. Thanks.

EDIT: Here is an actual snippet of the file, please refer to this rather than my original example- it's in the same format but different data which seems to have affected how the suggested sed line in Jean-François Fabre answer processes the data:

142106729748,Rocky Horror Book,http://www.ebay.co.uk/itm/Rocky-Horror-Book-/142106729748,0.99
162189532196,Total Film Issue 10,http://www.ebay.co.uk/itm/Total-Film-Issue-10-/162189532196,0.75
162189528365,Total Film Issue 9,http://www.ebay.co.uk/itm/Total-Film-Issue-9-/162189528365,0.99
172328113931,Captain America 163 Silver Age,http://www.ebay.co.uk/itm/Captain-America-163-Silver-Age-/172328113931,2.5
232069020935,Football Picture Story Monthly,http://www.ebay.co.uk/itm/Football-Picture-Story-Monthly-/232069020935,0.25
262606117082,The geographical Tradition ,http://www.ebay.co.uk/itm/geographical-Tradition-/262606117082,10.0
401182170339,Naruto Official Fanbook,http://www.ebay.co.uk/itm/Naruto-Official-Fanbook-/401182170339,3.0

Upvotes: 1

Views: 93

Answers (4)

Jean-François Fabre
Jean-François Fabre

Reputation: 140287

sed does it (barely, but does it)

sed -r '/^[^,]+, (\w+[, ]){4,}/!d' text.txt

small explanation:

  • first field is skipped
  • {4,} matches 4 or more words separated by space or comma (next field)
  • the !d command does not delete those (delete non-matching ones)
  • needs -r option or some stuff won't work like \w

result:

1999, random strings go here, £50.00, 983903893
1945, North Wales is the Adrenaline Capital of Europe, £78.99, 83983

(well I had to fix the pound sign by hand actually :))

Edit: better just in case a TAB finds its way in the file (thanks potong)

 sed -r '/^[^,]*,([[:space:]]+[^[:space:],]+){4}/!d'

Upvotes: 3

Ed Morton
Ed Morton

Reputation: 204558

sed is for simple substitutions on individual lines, that is all. For anything else you should be using awk:

$ awk -F' *, *' 'split($2,t,/ */)>3' file
1999, random strings go here, £50.00, 983903893
1945, North Wales is the Adrenaline Capital of Europe, £78.99, 83983

Upvotes: 0

heemayl
heemayl

Reputation: 42127

With awk:

awk -F ', ' '{split($2, arr, " "); if(length(arr)>=4) print}' file.txt
  • split($2, arr, " ") creates array arr by splitting the second field on space

  • if(length(arr)>=4) print prints the record only if the length of the array is >=4

Example:

% cat file.txt                                                          
1999, random strings go here, £50.00, 983903893
1957, Another lacklustre line, £99.00, 3983093
1987, Adventure UK, £83.83, 39939
1945, North Wales is the Adrenaline Capital of Europe, £78.99, 83983

% awk -F ', ' '{split($2, arr, " "); if(length(arr)>=4) print}' file.txt
1999, random strings go here, £50.00, 983903893
1945, North Wales is the Adrenaline Capital of Europe, £78.99, 83983

Upvotes: 1

choroba
choroba

Reputation: 242333

Perl solution:

perl -waF, -i~ -ne 'print if 3 < split " ", $F[1]' -- text.txt
  • -w turns on warnings
  • -n reads the input line by line
  • -a splits each input line into the @F array
  • -F tells Perl how to split it, in this case it uses a comma
  • -i modifies the file in place, ~ will be used as a suffix for the backup

split in scalar context returns the number of fields.

Upvotes: 2

Related Questions