Reputation: 37
I have a CSV file of 15000 rows. From the list I want to delete the unwanted products/manufacturers. I have a list with manufacturers and the source CSV file.
I found that sed would be appropiate but I'm hanging around the loop.
while read line
do
unwanted = $
sed "|"$unwanted|d" /home/arno/pixtmp/pixtmp.csv >/home/arno/pixtmp/pix-clean.c$
done < /home/bankey/shopimport/unwanted.txt
Any help is appreciated.
Inputfile:
CONSUMABLES;Inktpatronen voor printer;Inkt voor printer;B0137790;HP;Pakket 2 inktpatronen No339 - Zwart + Papier Goodway - 80 g/m² - A4 - 500 vel;Dit pakket van 2 inktpatronen nr 339 zijn ontworpen voor uw HP printer en leveren afdrukken van kwaliteit.;47.19;6.99;47.19;http://pan8.fotovista.com/dev/8/5/32150358/l_32150358.jpg;in stock;0.2;0.11201;9.99;;C9504EE;0;;
Upvotes: 1
Views: 369
Reputation: 753725
I'd use sed
in two steps:
sed
script from the unwanted information.That might be:
unwanted=/home/bankey/shopimport/unwanted.txt
datafile=/home/arno/pixtmp/pixtmp.csv
cleaned=/home/arno/pixtmp/pix-clean.csv
sed 's%.*%/,&,/d%' $unwanted > sed.script
sed -f sed.script $datafile > $cleaned
rm -f sed.script
The first invocation of sed
simply replace the contents of each line describing unwanted records with a sed
command that will delete it as a comma-separated field in the middle of an data line. If you have to handle unwanted fields at the beginning or the end too, then you have to work harder. You also have to work harder if there might be embedded slashes, commas, quotes etc. The second invocation of sed
applies the script created by the first to the data file, generating the cleaned file.
You can improve it by ensuring the script file name is unique, and by trapping the script file if the process is interrupted:
tmp=$(mktemp /tmp/script.XXXXXX)
trap "rm -f $tmp; exit 1" 0 1 2 3 13 15 # EXIT, HUP, INT, QUIT, PIPE, TERM
unwanted=/home/bankey/shopimport/unwanted.txt
datafile=/home/arno/pixtmp/pixtmp.csv
cleaned=/home/arno/pixtmp/pix-clean.csv
sed 's%.*%/,&,/d%' $unwanted > $tmp
sed -f $tmp $datafile > $cleaned
rm -f $tmp
trap 0 # Cancel the exit trap
With GNU sed
, but not with Mac OS X (BSD) sed
, you could avoid the intermediate file thus:
unwanted=/home/bankey/shopimport/unwanted.txt
datafile=/home/arno/pixtmp/pixtmp.csv
cleaned=/home/arno/pixtmp/pix-clean.csv
sed 's%.*%/,&,/d%' $unwanted |
sed -f - $datafile > $cleaned
This tells the second sed
to read its script from standard input. If you have bash
version 4.x (not standard on Mac OS X), you could use process substitution instead:
unwanted=/home/bankey/shopimport/unwanted.txt
datafile=/home/arno/pixtmp/pixtmp.csv
cleaned=/home/arno/pixtmp/pix-clean.csv
sed -f <(sed 's%.*%/,&,/d%' $unwanted) $datafile > $cleaned
Upvotes: 1
Reputation: 212248
sed
is less suited than awk
. For example, assuming your input file and your list of undesired terms are space delimited, you could simply do:
awk 'NR==FNR { a[$0]++ } NR != FNR && !a[$1]' undesired input
This will print out the file 'input' file, omitting any line in which the first column matches a line in the file undesired
.
Upvotes: 0
Reputation: 200283
You have to make sure that each loop cycle takes the output file from the previous cycle as the input file, otherwise you'll keep overwriting the output file with the content of the original file minus the last unwanted record.
If your sed
command supports inline editing (option -i
) you can do this:
cp /home/arno/pixtmp/pixtmp.csv /home/arno/pixtmp/pix-clean.csv
while read line; do
sed -i "/$line/d" /home/arno/pixtmp/pix-clean.csv
done < /home/bankey/shopimport/unwanted.txt
Otherwise you have to handle the temporary file yourself:
cp /home/arno/pixtmp/pixtmp.csv /home/arno/pixtmp/pix-clean.csv
while read line; do
sed "/$line/d" /home/arno/pixtmp/pix-clean.csv >/home/arno/pixtmp/pix-clean.c$
mv -f /home/arno/pixtmp/pix-clean.c$ /home/arno/pixtmp/pix-clean.csv
done < /home/bankey/shopimport/unwanted.txt
Upvotes: 0