user1628658
user1628658

Reputation: 587

How to make sed ignore slash and backslash

Okay: first of all I did try searching for answers before posting this one. I'm not saying, that there are none, I'm just saying, that I was unable to find one. To my defense: I've been forced to switch from scripting and other interesting work into working as a grocery shop assistant, so my brain probably rotted away.

What I'm trying to do is the following:

I have a file which contains, let's say, descriptions of goods including EAN codes. There are no proper delimiters. I only have "column lengths". I know, that EAN code column starts at the position 134 and ends at the position 147.

I tried using this:

cat $processedFile | sed "s/^(.{134})/\1;/g" | sed "s/^(.{148})/\1;/g >> $outFile

My problem is this:

Since people working with the software which generates the files are extremely computer mistrusting don't really care what they use when naming goods. Therefore some items contain slash or backslash as a part of their name, which is then another column in the file, incidentally right in front of the EAN.

Some columns therefore remain unprocessed.

Example of the input file:

00110363 201406170014469 35.0 1 35.000 0.2360 0.3720 T SnackName001 chees-onion8588004269750 0291410610363 0 0.00.000 00110363 201406170013935 24.0 1 24.000 0.2780 0.4320 T SnackName002 blah-blah-b78588000510535 0291410610363 0 0.00.000 00110363 201406170013936 24.0 1 24.000 0.2780 0.4320 T SnackName003 blah-blah-b78588000510511 0291410610363 0 0.00.000 00110363 201406170016056 18.0 1 18.000 0.2033 0.3520 T SnackName004 blah-blah 3838700069938 0291410610363 0 0.00.000 00110363 201406170013808 10.0 1 10.000 0.5794 0.9220 T SnackName005 blah-blah-b8588000467617 0291410610363 0 0.00.000 00110363 201406170009326 8.0 5 40.000 0.7500 1.2120 T Sugar powd. brandN\ED1kg 8594003782411 0291410610363 0 0.00.000

The last 3-line is an example of what causes me headaches.

Any hints? Or... would it be better to use something entirely different from sed?

I need to make sure, that the scripts are idiot-proof since I expect that people who have difficulties to find the power button on the chassis will be working with them later on.

EDIT: I apologize, I didn't realize, that EANs aren't so easily distinguishable in my example ^_^; , thank you, condorwasabi .

EAN code is the Integer following string names. To be more precise: in 00110363 201406170014469 35.0 1 35.000 0.2360 0.3720 T SnackName001 chees-onion8588004269750 0291410610363 0 0.00.000 the 8588004269750 is the EAN part. And yes, in the file, if the name is too long, there is no space, colon, semicolon or any other character to mark the end of the name and the beginning of the EAN code.

Upvotes: 0

Views: 656

Answers (1)

mklement0
mklement0

Reputation: 437109

I suggest using awk.

I'm not fully clear on the requirements, but this may get you started:

awk '{ 
    cleanLine=substr($0,1)
    gsub(/\\[A-Z]{2}/, "", cleanLine)
    EAN=substr(cleanLine, 134, 13)
    sub(EAN, ";" EAN ";")
    print
 }' file
  • Temporarily creates a cleaned-up version of the input line with sequences such as \ED removed - any sequence of \ followed by two uppercase letters. You also mention / in your question - not sure what patterns to look for there, but the resulting regex must replace /\\[A-Z]{2}/ above.
    NOTE:
    • Here I assume that cleaning up simply means removing these sequences - if, on the other hand, they each represent a specific number of original characters, replace the "" argument to gsub() with a string composed of that number of dummy characters, e.g., "x".
    • The assumption is that all such sequences are extraneous sequences to be removed.
  • Extracts the EAN from the cleaned-up line by character positions.
  • Replaces the EAN in the original line with the EAN enclosed in ";" and prints the result.
    • Note that this assumes that the EAN doesn't also appear before column 134 in the input file.

Upvotes: 1

Related Questions