Reputation: 23
My file is :
07/15/19 02:58:40 PM toto2_UBD_FD_GPRS_GLBL_EXCP TABLE DATA COMPLETE BASE_TEST_DSA 0 0
07/15/19 02:58:40 PM toto_UBD_FD_GPRS_GLBL_EXCP TABLE DATA COMPLETE BASE_TEST_DSA 37484032 253118
07/15/19 02:55:59 PM UBD_FD_GPRS_GLBL_EXCP TABLE DATA COMPLETE BASE_TEST_DSA 37484032 253118
07/15/19 02:55:59 PM UBV_FD_GSM_GLBL_EXCP TABLE DATA COMPLETE BASE_TEST_DSA 0 0
I expected
07/15/19 02:58:40 PM;toto2_UBD_FD_GPRS_GLBL_EXCP;TABLE;DATA;COMPLETE;BASE_TEST_DSA;0;0;;
07/15/19 02:58:40 PM;toto_UBD_FD_GPRS_GLBL_EXCP;TABLE;DATA;COMPLETE;BASE_TEST_DSA;37484032;253118;;
07/15/19 02:55:59 PM;UBD_FD_GPRS_GLBL_EXCP;TABLE;DATA;COMPLETE;BASE_TEST_DSA;37484032;253118;;
07/15/19 02:55:59 PM;UBV_FD_GSM_GLBL_EXCP;TABLE;DATA;COMPLETE;BASE_TEST_DSA;0;0;;
I tried
sed -e 's/ /;/g'
or
awk -F' ' -v OFS=';' '{$1=$1}1' file
The result is :
07/15/19 02:58:40 PM;toto2_UBD_FD_GPRS_GLBL_EXCP;TABLE;;;DATA; COMPLETE;BASE_TEST_DSA;0;;;;0;;;;;;;;;;;;;
07/15/19 02:58:40 PM;toto_UBD_FD_GPRS_GLBL_EXCP; TABLE;;;DATA; COMPLETE;BASE_TEST_DSA;37484032; 253118;;;;;;;;;;;
07/15/19 02:55:59 PM;UBD_FD_GPRS_GLBL_EXCP;;;TABLE;;;DATA; COMPLETE;BASE_TEST_DSA;37484032; 253118;;;;;;;;;;;
07/15/19 02:55:59 PM;UBV_FD_GSM_GLBL_EXCP;;; TABLE;;;DATA; COMPLETE;BASE_TEST_DSA;0;;;;0;;;;;;;;;;;;;
I need to put this data into a table, with 10 delimited fields. So I need 9 semicolons on my file, otherwise i cant load my file. The difficulty for me here, is the last two colons, It's error code and error text, it can be null. It's NULL in my exemple file. The other difficulty is the whitespace, there is not delimiter between the data in source file.
Upvotes: 0
Views: 88
Reputation: 753695
Judging from the data shown, the input file has three blanks between columns. Therefore, a sed
script is appropriate:
sed -e 's/ /;/g'
Replace sequences of three blanks with a semicolon, all the way along the line.
The fact that line 3 has more columns than the other lines is likely be a headache when loading the data into a database — if that's what you mean by 'import'.
The code makes no attempt to handle pre-existing semicolons in the data.
Original data:
07/15/19 02:58:40 PM toto2 TABLE DATA COMPLETE BASE_TEST 0 0
07/15/19 02:58:40 PM toto TABLE DATA COMPLETE BASE_TEST 37484032 253118
07/15/19 02:55:59 PM UBD TABLE DATA COMPLETE BASE_TEST 32 2 CORERROR LIBLERROR
07/15/19 02:55:59 PM UBV TABLE DATA COMPLETE BASE_TEST 0 0
When I copied the data from the question, I found that lines 1, 2, 4 each had 6 trailing blanks. The sed
script produced:
07/15/19 02:58:40 PM;toto2;TABLE;DATA;COMPLETE;BASE_TEST;0;0;;
07/15/19 02:58:40 PM;toto;TABLE;DATA;COMPLETE;BASE_TEST;37484032;253118;;
07/15/19 02:55:59 PM;UBD;TABLE;DATA;COMPLETE;BASE_TEST;32;2;CORERROR;LIBLERROR
07/15/19 02:55:59 PM;UBV;TABLE;DATA;COMPLETE;BASE_TEST;0;0;;
Note that there are two trailing semicolons on those lines, for the empty fields containing CORERROR
and LIBLERROR
in line 3. So, it appears that the data does have a uniform number of fields after all. Trailing spaces are remarkably hard to spot!
The original data had a uniform 3-spaces between fields. The revised data has a non-uniform number of spaces between fields — from 3 upwards. That can be handled in sed
using:
sed -e 's/ \{3,\}/;/g'
The only problem is now that adjacent empty fields can't be distinguished.
On the data:
07/15/19 02:58:40 PM toto2_UBD_FD_GPRS_GLBL_EXCP TABLE DATA COMPLETE BASE_TEST_DSA 0 0
07/15/19 02:58:40 PM toto_UBD_FD_GPRS_GLBL_EXCP TABLE DATA COMPLETE BASE_TEST_DSA 37484032 253118
07/15/19 02:55:59 PM UBD_FD_GPRS_GLBL_EXCP TABLE DATA COMPLETE BASE_TEST_DSA 37484032 253118
07/15/19 02:55:59 PM UBV_FD_GSM_GLBL_EXCP TABLE DATA COMPLETE BASE_TEST_DSA 0 0
The output is:
07/15/19 02:58:40 PM;toto2_UBD_FD_GPRS_GLBL_EXCP;TABLE;DATA;COMPLETE;BASE_TEST_DSA;0;0
07/15/19 02:58:40 PM;toto_UBD_FD_GPRS_GLBL_EXCP;TABLE;DATA;COMPLETE;BASE_TEST_DSA;37484032;253118
07/15/19 02:55:59 PM;UBD_FD_GPRS_GLBL_EXCP;TABLE;DATA;COMPLETE;BASE_TEST_DSA;37484032;253118
07/15/19 02:55:59 PM;UBV_FD_GSM_GLBL_EXCP;TABLE;DATA;COMPLETE;BASE_TEST_DSA;0;0
I need 9 semicolons because it is possible that my last 2 data, per line, could be an error code and an error text. I do not have an example, but I need to insert null when I do not have an error code. So, is it possible to detect the number of semicolons per line, if 9 semicolons then the line is OK, if not add 2 semicolons at the end of the line. My first 7 data are mandatory, the last 2 are not. I can not load the file with a different number of separators in the lines of my file.
Then we need a fiddly second regex to provide the missing semicolons. I'll also cater to the possibility of there being only the error code and not any error text. This also handles data both with and without trailing spaces after the seven compulsory fields.
sed -e 's/ \{3,\}/;/g' \
-e 's/^\([^;]*;\)\{1,8\}[^;][^;]*$/&;/' \
-e 's/^\([^;]*;\)\{8\}$/&;/'
The second line is close to inscrutable, but \([^;]*;\)\{1,8\}[%;][%;]*$
can be split into a count (\{1,8\}
) of 1 to 8 sequences of 'zero or more non-semicolons' ([^;]*
), the body of a field, followed by a semicolon, all wrapped in grouping parentheses \(…\)
, and followed by one or more occurrences of something other than a semicolon and the end of line. The replacement adds a semicolon. This deals with lines without trailing spaces at the end, or those with one or two trailing spaces (three or more trailing spaces are converted to a semicolon by the first s///
).
The last part looks for lines with 8 semi-colon terminated fields only, and adds an extra semicolon.
This is why I prefer to use -e
options — it allows me to separate the complex regexes to make the code more readable (to the extent that regexes can be made readable). You could combine all three -e
options into a single string argument separated by semicolons:
sed 's/ \{3,\}/;/g; s/^\([^;]*;\)\{1,8\}[^;][%;]*$/&;/; s/^\([^;]*;\)\{8\}$/&;/'
but that is an intimidating command line to read. You could omit the spaces; that makes it still harder to read. I definitely prefer the clarity and readability of three separate -e
options, not least because it is much, much, much easier to see the symmetry between the last two patterns when they're vertically aligned.
Fortunately, I test before posting. For several iterations of the commentary and code above, the output data wasn't working as expected, in part because the data I had didn't have trailing spaces at the end.
Given this modified data (lines 2-4 have trailing blanks; line 1 doesn't; line 5 has an error code and error text; line 6 has an error code but no error text):
07/15/19 02:58:40 PM toto2_UBD_FD_GPRS_GLBL_EXCP TABLE DATA COMPLETE BASE_TEST_DSA 0 0
07/15/19 02:58:40 PM toto_UBD_FD_GPRS_GLBL_EXCP TABLE DATA COMPLETE BASE_TEST_DSA 37484032 253118
07/15/19 02:55:59 PM UBD_FD_GPRS_GLBL_EXCP TABLE DATA COMPLETE BASE_TEST_DSA 37484032 253118
07/15/19 02:55:59 PM UBV_FD_GSM_GLBL_EXCP TABLE DATA COMPLETE BASE_TEST_DSA 0 0
09/04/19 07:36:21 AM UBV_FD_GSM_GLBL_EXCP TABLE DATA COMPLETE BASE_TEST_DSA 0 252291 CORERROR LIBERROR
09/04/19 07:36:57 AM GRANLIBAKKEN CHAIR DATA COMPLETE RUNCIBLE_SPOON 0 252291 CORERROR
For clarity, I have a program that adds \n
after trailing white space. It presents the data as:
07/15/19 02:58:40 PM toto2_UBD_FD_GPRS_GLBL_EXCP TABLE DATA COMPLETE BASE_TEST_DSA 0 0
07/15/19 02:58:40 PM toto_UBD_FD_GPRS_GLBL_EXCP TABLE DATA COMPLETE BASE_TEST_DSA 37484032 253118 \n
07/15/19 02:55:59 PM UBD_FD_GPRS_GLBL_EXCP TABLE DATA COMPLETE BASE_TEST_DSA 37484032 253118 \n
07/15/19 02:55:59 PM UBV_FD_GSM_GLBL_EXCP TABLE DATA COMPLETE BASE_TEST_DSA 0 0 \n
09/04/19 07:36:21 AM UBV_FD_GSM_GLBL_EXCP TABLE DATA COMPLETE BASE_TEST_DSA 0 252291 CORERROR LIBERROR
09/04/19 07:36:57 AM GRANLIBAKKEN CHAIR DATA COMPLETE RUNCIBLE_SPOON 0 252291 CORERROR \n
The output from the sed
script is:
07/15/19 02:58:40 PM;toto2_UBD_FD_GPRS_GLBL_EXCP;TABLE;DATA;COMPLETE;BASE_TEST_DSA;0;0;;
07/15/19 02:58:40 PM;toto_UBD_FD_GPRS_GLBL_EXCP;TABLE;DATA;COMPLETE;BASE_TEST_DSA;37484032;253118;;
07/15/19 02:55:59 PM;UBD_FD_GPRS_GLBL_EXCP;TABLE;DATA;COMPLETE;BASE_TEST_DSA;37484032;253118;;
07/15/19 02:55:59 PM;UBV_FD_GSM_GLBL_EXCP;TABLE;DATA;COMPLETE;BASE_TEST_DSA;0;0;;
09/04/19 07:36:21 AM;UBV_FD_GSM_GLBL_EXCP;TABLE;DATA;COMPLETE;BASE_TEST_DSA;0;252291;CORERROR;LIBERROR
09/04/19 07:36:57 AM;GRANLIBAKKEN;CHAIR;DATA;COMPLETE;RUNCIBLE_SPOON;0;252291;CORERROR;
I had the sed
command in a simple file called chkit.sh
, and one of the tests was:
$ sh chkit.sh < data | tr -dc ';\n'
;;;;;;;;;
;;;;;;;;;
;;;;;;;;;
;;;;;;;;;
;;;;;;;;;
;;;;;;;;;
$
The tr
command deletes all characters except semicolon and newline, which allows me to see that the lines all have the same number of semicolons in them. I also have a script length
that prints the length of the line, a colon, a space and the content of the line:
$ sh chkit.sh < data | tr -dc ';\n' | length
9: ;;;;;;;;;
9: ;;;;;;;;;
9: ;;;;;;;;;
9: ;;;;;;;;;
9: ;;;;;;;;;
9: ;;;;;;;;;
$
So, the output always contains 9 semicolons. This can be quite a useful way of validating delimited data as long as the delimiter does not appear within any of the data fields.
If I were working towards production code, the script would have a better name than chkit.sh
and would have "$@"
to take command line arguments (file names, or, indeed, extra script — I could add more -e '…'
options to the command line; yet another reason to prefer using -e
).
This also goes to show how crucial good (meaning comprehensive, covering all special cases as well as the routine) sample data is. With different sample data, you might come up with a different solution.
Upvotes: 3
Reputation: 203493
$ awk -F' ' -v OFS=';' '{$1=$1}1' file
07/15/19 02:58:40 PM;toto2;TABLE;DATA;COMPLETE;BASE_TEST;0;0;;
07/15/19 02:58:40 PM;toto;TABLE;DATA;COMPLETE;BASE_TEST;37484032;253118;;
07/15/19 02:55:59 PM;UBD;TABLE;DATA;COMPLETE;BASE_TEST;32;2;CORERROR;LIBLERROR
07/15/19 02:55:59 PM;UBV;TABLE;DATA;COMPLETE;BASE_TEST;0;0;;
Upvotes: 3