Anurag Tiwary
Anurag Tiwary

Reputation: 143

How to remove a specific string common in multiple lines in a CSV file using shell script?

I have a csv file which contains 65000 lines (Size approximately 28 MB). In each of the lines a certain path in the beginning is given e.g. "c:\abc\bcd\def\123\456". Now let's say the path "c:\abc\bcd\" is common in all the lines and rest of the content is different. I have to remove the common part (In this case "c:\abc\bcd\") from all the lines using a shell script. For example the content of the CSV file is as mentioned.

C:/Abc/Def/Test/temp\.\test\GLNext\FILE0.frag                   0   0   0
C:/Abc/Def/Test/temp\.\test\GLNext\FILE0.vert                   0   0   0
C:/Abc/Def/Test/temp\.\test\GLNext\FILE0.link-link-0.frag       16  24  3
C:/Abc/Def/Test/temp\.\test\GLNext\FILE0.link-link-0.vert       87  116 69
C:/Abc/Def/Test/temp\.\test\GLNext\FILE0.link-link-0.vert.bin   75  95  61
C:/Abc/Def/Test/temp\.\test\GLNext\FILE0.link-link-0            0   0
C:/Abc/Def/Test/temp\.\test\GLNext\FILE0.link-link-6            0   0   0 

In the above example I need the output as below

FILE0.frag                  0   0   0
FILE0.vert                  0   0   0
FILE0.link-link-0.frag      17  25  2
FILE0.link-link-0.vert      85  111 68
FILE0.link-link-0.vert.bin  77  97  60
FILE0.link-link-0               0   0
FILE0.link                  0   0   0

Can any of you please help me out with this?

Upvotes: 0

Views: 1920

Answers (2)

NeronLeVelu
NeronLeVelu

Reputation: 10039

# init variable
MaxPath="$( sed -n 's/,.*//p;1q' YourFile )"
GrepPath="^$( printf "%s" "${MaxPath}" | sed 's#\\#\\\\#g' )"

# search the biggest pattern to remove
while [ ${#MaxPath} -gt 0 ] && [ $( grep -c -v -E "${GrepPath}" YourFile ) -gt 0 ]
 do
   MaxPath="${MaxPath%%?}"
   GrepPath="^$( printf "%s" "${MaxPath}" | sed 's#\\#\\\\#g' )"
 done

# Adapt your file
if [ ${#MaxPath} -gt 0 ]
 then
   sed "s#${GrepPath}##" YourFile
 fi
  • Assuming for the sample that there is no special regex char nor # in MaxPath
  • the grep -c -v -E is not optimized in term of performance (treat whle file each time where it can stop at first miss)

Upvotes: 0

chw21
chw21

Reputation: 8140

You could use sed:

$ cat test.csv 
"c:\abc\bcd\def\123\456", 1, 2
"c:\abc\bcd\def\234\456", 1, 2
"c:\abc\bcd\def\432\456", 3, 4

$ sed -i.bak -e 's/c\:\\abc\\bcd\\//1' test.csv

$ cat test.csv
"def\123\456", 1, 2
"def\234\456", 1, 2
"def\432\456", 3, 4

I am using sed here in this way:

sed -e 's/<SEARCH TERM>/<REPLACE_TERM>/<OCCURANCE>' FILE

where

  • <SEARCH TERM> is what we are looking for (in this case c:\abc\bcd\, but backslashes need to be escaped).
  • <REPLACE TERM> is what we want to replace it with, in this case nothing, and
  • <OCCURANCE> is which occurance of the item we want to replace, in this case the first item in each line.

(-i.bak stands for: Don't output, just edit this file. (but make a backup first))

Updated according to @david-c-rankin comment. He is right, make a backup before editing files in case you make a mistake.

Upvotes: 1

Related Questions