Reputation: 587
Okay: first of all I did try searching for answers before posting this one. I'm not saying, that there are none, I'm just saying, that I was unable to find one. To my defense: I've been forced to switch from scripting and other interesting work into working as a grocery shop assistant, so my brain probably rotted away.
What I'm trying to do is the following:
I have a file which contains, let's say, descriptions of goods including EAN codes. There are no proper delimiters. I only have "column lengths". I know, that EAN code column starts at the position 134 and ends at the position 147.
I tried using this:
cat $processedFile | sed "s/^(.{134})/\1;/g" | sed "s/^(.{148})/\1;/g >> $outFile
My problem is this:
Since people working with the software which generates the files are extremely computer mistrusting don't really care what they use when naming goods. Therefore some items contain slash or backslash as a part of their name, which is then another column in the file, incidentally right in front of the EAN.
Some columns therefore remain unprocessed.
Example of the input file:
00110363 201406170014469 35.0 1 35.000 0.2360 0.3720 T SnackName001 chees-onion8588004269750 0291410610363 0 0.00.000
00110363 201406170013935 24.0 1 24.000 0.2780 0.4320 T SnackName002 blah-blah-b78588000510535 0291410610363 0 0.00.000
00110363 201406170013936 24.0 1 24.000 0.2780 0.4320 T SnackName003 blah-blah-b78588000510511 0291410610363 0 0.00.000
00110363 201406170016056 18.0 1 18.000 0.2033 0.3520 T SnackName004 blah-blah 3838700069938 0291410610363 0 0.00.000
00110363 201406170013808 10.0 1 10.000 0.5794 0.9220 T SnackName005 blah-blah-b8588000467617 0291410610363 0 0.00.000
00110363 201406170009326 8.0 5 40.000 0.7500 1.2120 T Sugar powd. brandN\ED1kg 8594003782411 0291410610363 0 0.00.000
The last 3-line is an example of what causes me headaches.
Any hints? Or... would it be better to use something entirely different from sed?
I need to make sure, that the scripts are idiot-proof since I expect that people who have difficulties to find the power button on the chassis will be working with them later on.
EDIT: I apologize, I didn't realize, that EANs aren't so easily distinguishable in my example ^_^; , thank you, condorwasabi .
EAN code is the Integer following string names. To be more precise:
in
00110363 201406170014469 35.0 1 35.000 0.2360 0.3720 T SnackName001 chees-onion8588004269750 0291410610363 0 0.00.000
the 8588004269750
is the EAN part. And yes, in the file, if the name is too long, there is no space, colon, semicolon or any other character to mark the end of the name and the beginning of the EAN code.
Upvotes: 0
Views: 656
Reputation: 437109
I suggest using awk
.
I'm not fully clear on the requirements, but this may get you started:
awk '{
cleanLine=substr($0,1)
gsub(/\\[A-Z]{2}/, "", cleanLine)
EAN=substr(cleanLine, 134, 13)
sub(EAN, ";" EAN ";")
print
}' file
\ED
removed - any sequence of \
followed by two uppercase letters. You also mention /
in your question - not sure what patterns to look for there, but the resulting regex must replace /\\[A-Z]{2}/
above.""
argument to gsub()
with a string composed of that number of dummy characters, e.g., "x"
.Upvotes: 1