Reputation: 23

Non delimited file on delimited file

My file is :

07/15/19 02:58:40 PM   toto2_UBD_FD_GPRS_GLBL_EXCP   TABLE         DATA    COMPLETE   BASE_TEST_DSA   0            0
07/15/19 02:58:40 PM   toto_UBD_FD_GPRS_GLBL_EXCP    TABLE         DATA    COMPLETE   BASE_TEST_DSA   37484032     253118
07/15/19 02:55:59 PM   UBD_FD_GPRS_GLBL_EXCP         TABLE         DATA    COMPLETE   BASE_TEST_DSA   37484032     253118
07/15/19 02:55:59 PM   UBV_FD_GSM_GLBL_EXCP          TABLE         DATA    COMPLETE   BASE_TEST_DSA   0            0

I expected

07/15/19 02:58:40 PM;toto2_UBD_FD_GPRS_GLBL_EXCP;TABLE;DATA;COMPLETE;BASE_TEST_DSA;0;0;;
07/15/19 02:58:40 PM;toto_UBD_FD_GPRS_GLBL_EXCP;TABLE;DATA;COMPLETE;BASE_TEST_DSA;37484032;253118;;
07/15/19 02:55:59 PM;UBD_FD_GPRS_GLBL_EXCP;TABLE;DATA;COMPLETE;BASE_TEST_DSA;37484032;253118;;
07/15/19 02:55:59 PM;UBV_FD_GSM_GLBL_EXCP;TABLE;DATA;COMPLETE;BASE_TEST_DSA;0;0;;

I tried sed -e 's/ /;/g' or awk -F' ' -v OFS=';' '{$1=$1}1' file

The result is :

07/15/19 02:58:40 PM;toto2_UBD_FD_GPRS_GLBL_EXCP;TABLE;;;DATA; COMPLETE;BASE_TEST_DSA;0;;;;0;;;;;;;;;;;;;
07/15/19 02:58:40 PM;toto_UBD_FD_GPRS_GLBL_EXCP; TABLE;;;DATA; COMPLETE;BASE_TEST_DSA;37484032;  253118;;;;;;;;;;;
07/15/19 02:55:59 PM;UBD_FD_GPRS_GLBL_EXCP;;;TABLE;;;DATA; COMPLETE;BASE_TEST_DSA;37484032;  253118;;;;;;;;;;;
07/15/19 02:55:59 PM;UBV_FD_GSM_GLBL_EXCP;;; TABLE;;;DATA; COMPLETE;BASE_TEST_DSA;0;;;;0;;;;;;;;;;;;;

I need to put this data into a table, with 10 delimited fields. So I need 9 semicolons on my file, otherwise i cant load my file. The difficulty for me here, is the last two colons, It's error code and error text, it can be null. It's NULL in my exemple file. The other difficulty is the whitespace, there is not delimiter between the data in source file.

Upvotes: 0

Answers (2)

Jonathan Leffler

Reputation: 753695

For the original data

Judging from the data shown, the input file has three blanks between columns. Therefore, a sed script is appropriate:

sed -e 's/   /;/g'

Replace sequences of three blanks with a semicolon, all the way along the line.

The fact that line 3 has more columns than the other lines is likely be a headache when loading the data into a database — if that's what you mean by 'import'.

The code makes no attempt to handle pre-existing semicolons in the data.

Original data:

07/15/19 02:58:40 PM   toto2   TABLE   DATA   COMPLETE   BASE_TEST   0   0      
07/15/19 02:58:40 PM   toto   TABLE   DATA   COMPLETE   BASE_TEST   37484032   253118      
07/15/19 02:55:59 PM   UBD   TABLE   DATA   COMPLETE   BASE_TEST   32   2   CORERROR   LIBLERROR
07/15/19 02:55:59 PM   UBV   TABLE   DATA   COMPLETE   BASE_TEST   0   0

When I copied the data from the question, I found that lines 1, 2, 4 each had 6 trailing blanks. The sed script produced:

07/15/19 02:58:40 PM;toto2;TABLE;DATA;COMPLETE;BASE_TEST;0;0;;
07/15/19 02:58:40 PM;toto;TABLE;DATA;COMPLETE;BASE_TEST;37484032;253118;;
07/15/19 02:55:59 PM;UBD;TABLE;DATA;COMPLETE;BASE_TEST;32;2;CORERROR;LIBLERROR
07/15/19 02:55:59 PM;UBV;TABLE;DATA;COMPLETE;BASE_TEST;0;0;;

Note that there are two trailing semicolons on those lines, for the empty fields containing CORERROR and LIBLERROR in line 3. So, it appears that the data does have a uniform number of fields after all. Trailing spaces are remarkably hard to spot!

For the revised data

The original data had a uniform 3-spaces between fields. The revised data has a non-uniform number of spaces between fields — from 3 upwards. That can be handled in sed using:

sed -e 's/ \{3,\}/;/g'

The only problem is now that adjacent empty fields can't be distinguished.

On the data:

07/15/19 02:58:40 PM   toto2_UBD_FD_GPRS_GLBL_EXCP   TABLE         DATA    COMPLETE   BASE_TEST_DSA   0            0
07/15/19 02:58:40 PM   toto_UBD_FD_GPRS_GLBL_EXCP    TABLE         DATA    COMPLETE   BASE_TEST_DSA   37484032     253118
07/15/19 02:55:59 PM   UBD_FD_GPRS_GLBL_EXCP         TABLE         DATA    COMPLETE   BASE_TEST_DSA   37484032     253118
07/15/19 02:55:59 PM   UBV_FD_GSM_GLBL_EXCP          TABLE         DATA    COMPLETE   BASE_TEST_DSA   0            0

The output is:

07/15/19 02:58:40 PM;toto2_UBD_FD_GPRS_GLBL_EXCP;TABLE;DATA;COMPLETE;BASE_TEST_DSA;0;0
07/15/19 02:58:40 PM;toto_UBD_FD_GPRS_GLBL_EXCP;TABLE;DATA;COMPLETE;BASE_TEST_DSA;37484032;253118
07/15/19 02:55:59 PM;UBD_FD_GPRS_GLBL_EXCP;TABLE;DATA;COMPLETE;BASE_TEST_DSA;37484032;253118
07/15/19 02:55:59 PM;UBV_FD_GSM_GLBL_EXCP;TABLE;DATA;COMPLETE;BASE_TEST_DSA;0;0

Needing 9 semicolons

I need 9 semicolons because it is possible that my last 2 data, per line, could be an error code and an error text. I do not have an example, but I need to insert null when I do not have an error code. So, is it possible to detect the number of semicolons per line, if 9 semicolons then the line is OK, if not add 2 semicolons at the end of the line. My first 7 data are mandatory, the last 2 are not. I can not load the file with a different number of separators in the lines of my file.

Then we need a fiddly second regex to provide the missing semicolons. I'll also cater to the possibility of there being only the error code and not any error text. This also handles data both with and without trailing spaces after the seven compulsory fields.

sed -e 's/ \{3,\}/;/g' \
    -e 's/^\([^;]*;\)\{1,8\}[^;][^;]*$/&;/' \
    -e 's/^\([^;]*;\)\{8\}$/&;/'

The second line is close to inscrutable, but $[^;]*;$\{1,8\}[%;][%;]*$ can be split into a count (\{1,8\}) of 1 to 8 sequences of 'zero or more non-semicolons' ([^;]*), the body of a field, followed by a semicolon, all wrapped in grouping parentheses $…$, and followed by one or more occurrences of something other than a semicolon and the end of line. The replacement adds a semicolon. This deals with lines without trailing spaces at the end, or those with one or two trailing spaces (three or more trailing spaces are converted to a semicolon by the first s///).

The last part looks for lines with 8 semi-colon terminated fields only, and adds an extra semicolon.

This is why I prefer to use -e options — it allows me to separate the complex regexes to make the code more readable (to the extent that regexes can be made readable). You could combine all three -e options into a single string argument separated by semicolons:

sed 's/ \{3,\}/;/g; s/^\([^;]*;\)\{1,8\}[^;][%;]*$/&;/; s/^\([^;]*;\)\{8\}$/&;/'

but that is an intimidating command line to read. You could omit the spaces; that makes it still harder to read. I definitely prefer the clarity and readability of three separate -e options, not least because it is much, much, much easier to see the symmetry between the last two patterns when they're vertically aligned.

Fortunately, I test before posting. For several iterations of the commentary and code above, the output data wasn't working as expected, in part because the data I had didn't have trailing spaces at the end.

Given this modified data (lines 2-4 have trailing blanks; line 1 doesn't; line 5 has an error code and error text; line 6 has an error code but no error text):

07/15/19 02:58:40 PM   toto2_UBD_FD_GPRS_GLBL_EXCP   TABLE         DATA    COMPLETE   BASE_TEST_DSA   0            0
07/15/19 02:58:40 PM   toto_UBD_FD_GPRS_GLBL_EXCP    TABLE         DATA    COMPLETE   BASE_TEST_DSA   37484032     253118      
07/15/19 02:55:59 PM   UBD_FD_GPRS_GLBL_EXCP         TABLE         DATA    COMPLETE   BASE_TEST_DSA   37484032     253118      
07/15/19 02:55:59 PM   UBV_FD_GSM_GLBL_EXCP          TABLE         DATA    COMPLETE   BASE_TEST_DSA   0            0      
09/04/19 07:36:21 AM   UBV_FD_GSM_GLBL_EXCP          TABLE         DATA    COMPLETE   BASE_TEST_DSA   0            252291   CORERROR   LIBERROR
09/04/19 07:36:57 AM   GRANLIBAKKEN                  CHAIR         DATA    COMPLETE   RUNCIBLE_SPOON   0            252291   CORERROR

For clarity, I have a program that adds \n after trailing white space. It presents the data as:

07/15/19 02:58:40 PM   toto2_UBD_FD_GPRS_GLBL_EXCP   TABLE         DATA    COMPLETE   BASE_TEST_DSA   0            0
07/15/19 02:58:40 PM   toto_UBD_FD_GPRS_GLBL_EXCP    TABLE         DATA    COMPLETE   BASE_TEST_DSA   37484032     253118      \n
07/15/19 02:55:59 PM   UBD_FD_GPRS_GLBL_EXCP         TABLE         DATA    COMPLETE   BASE_TEST_DSA   37484032     253118      \n
07/15/19 02:55:59 PM   UBV_FD_GSM_GLBL_EXCP          TABLE         DATA    COMPLETE   BASE_TEST_DSA   0            0      \n
09/04/19 07:36:21 AM   UBV_FD_GSM_GLBL_EXCP          TABLE         DATA    COMPLETE   BASE_TEST_DSA   0            252291   CORERROR   LIBERROR
09/04/19 07:36:57 AM   GRANLIBAKKEN                  CHAIR         DATA    COMPLETE   RUNCIBLE_SPOON   0            252291   CORERROR   \n

The output from the sed script is:

07/15/19 02:58:40 PM;toto2_UBD_FD_GPRS_GLBL_EXCP;TABLE;DATA;COMPLETE;BASE_TEST_DSA;0;0;;
07/15/19 02:58:40 PM;toto_UBD_FD_GPRS_GLBL_EXCP;TABLE;DATA;COMPLETE;BASE_TEST_DSA;37484032;253118;;
07/15/19 02:55:59 PM;UBD_FD_GPRS_GLBL_EXCP;TABLE;DATA;COMPLETE;BASE_TEST_DSA;37484032;253118;;
07/15/19 02:55:59 PM;UBV_FD_GSM_GLBL_EXCP;TABLE;DATA;COMPLETE;BASE_TEST_DSA;0;0;;
09/04/19 07:36:21 AM;UBV_FD_GSM_GLBL_EXCP;TABLE;DATA;COMPLETE;BASE_TEST_DSA;0;252291;CORERROR;LIBERROR
09/04/19 07:36:57 AM;GRANLIBAKKEN;CHAIR;DATA;COMPLETE;RUNCIBLE_SPOON;0;252291;CORERROR;

I had the sed command in a simple file called chkit.sh, and one of the tests was:

$ sh chkit.sh < data | tr -dc ';\n'
;;;;;;;;;
;;;;;;;;;
;;;;;;;;;
;;;;;;;;;
;;;;;;;;;
;;;;;;;;;
$

The tr command deletes all characters except semicolon and newline, which allows me to see that the lines all have the same number of semicolons in them. I also have a script length that prints the length of the line, a colon, a space and the content of the line:

$ sh chkit.sh < data | tr -dc ';\n' | length
9: ;;;;;;;;;
9: ;;;;;;;;;
9: ;;;;;;;;;
9: ;;;;;;;;;
9: ;;;;;;;;;
9: ;;;;;;;;;
$

So, the output always contains 9 semicolons. This can be quite a useful way of validating delimited data as long as the delimiter does not appear within any of the data fields.

If I were working towards production code, the script would have a better name than chkit.sh and would have "$@" to take command line arguments (file names, or, indeed, extra script — I could add more -e '…' options to the command line; yet another reason to prefer using -e).

This also goes to show how crucial good (meaning comprehensive, covering all special cases as well as the routine) sample data is. With different sample data, you might come up with a different solution.

Upvotes: 3

Ed Morton

Reputation: 203493

$ awk -F'   ' -v OFS=';' '{$1=$1}1' file
07/15/19 02:58:40 PM;toto2;TABLE;DATA;COMPLETE;BASE_TEST;0;0;;
07/15/19 02:58:40 PM;toto;TABLE;DATA;COMPLETE;BASE_TEST;37484032;253118;;
07/15/19 02:55:59 PM;UBD;TABLE;DATA;COMPLETE;BASE_TEST;32;2;CORERROR;LIBLERROR
07/15/19 02:55:59 PM;UBV;TABLE;DATA;COMPLETE;BASE_TEST;0;0;;