user1488085
user1488085

Reputation: 31

Convert arbitrary date-like string to date in Teradata

I have a column with such string values

12/01/1999
13021999
140301
bla bla 140302
just bla bla

They have either date in relatively defined format, either arbitrary text with date or arbitrary text.

I need to place this data in column of date datatype. How do I do it in Teradata?

If in Oracle I would write some procedure which would do some regexp/replace and then convert result to date. When exception return null. Then I would insert/select and use this procedure and here are my date column.

In Teradata 14.10 you can write SQL user defined function. But it only has on statement - return expression. No way to do exception handling and other statements like in procedure.

Or you can write external function in C/Java. And then you have to INSTALL it. I'm not sure if i have such rights in my environment. Additionally I am not in C/Java.

Any other options for this task?

Upvotes: 1

Views: 1431

Answers (1)

JNevill
JNevill

Reputation: 50218

There's plenty of ways to solve this one in Teradata.

You could get fancy in a stored procedure, moving a cursor through a recordset and doing the logic there. It would be slow, but the logic would be clean.

I would just do it in a SQL statement though, and get fancy with the CASE statement logic. I would split each record into words using strtok_split_to_table then check each token returned using regex and whatnot to figure out how it's formatted, and attempt to cast it based off of that. Something like:

CREATE MULTISET VOLATILE TABLE uglydates
(
    id INTEGER,
    uglydate VARCHAR(100)

)UNIQUE PRIMARY INDEX ("id") ON COMMIT PRESERVE ROWS;


INSERT INTO uglydates VALUES (1, '12/01/1999');
INSERT INTO uglydates VALUES (2, '13021999');
INSERT INTO uglydates VALUES (3, '140301');
INSERT INTO uglydates VALUES (4, 'bla bla 140302');
INSERT INTO uglydates VALUES (5, 'just bla bla');

SELECT
    outkey,
    token,
    CASE 
        WHEN REGEXP_SIMILAR(token, '^[0-9,/,-]*$') = 1 
            THEN /* We've got a date... maybe */            
                CASE 
                    WHEN "LENGTH"(token) = 10 AND length(OREPLACE(token, '/', '')) = "LENGTH"(token)-2 
                        THEN /* if the formats always the same in this scenario then */
                            CAST(token AS DATE FORMAT 'mm/dd/yyyy')
                    WHEN "LENGTH"(token) = 6 
                        THEN
                            CAST(token AS DATE FORMAT 'yymmdd')
                    END
            ELSE NULL
    END
FROM
    (
        SELECT d.token, d.outkey
        FROM TABLE (strtok_split_to_table(uglydates.id, uglydates.uglydate, ' ')
                RETURNS (outkey integer, tokennum integer, token varchar(20)character set unicode) ) as d
    ) tokens

That's a pretty bare bones starter, but it get you in the right direction. I've used strtok_split_to_table to parse the ugly date text and return a record for each word, as well as the record's id so I can insert/update later on without losing reference of the record. The rest is pretty straightforward regex and whatnot which should look familiar to whatever you were doing in Oracle.

The example above produces the following output:

+--------+------------+--------------------+
| outkey |   token    | <CASE  expression> |
+--------+------------+--------------------+
|      5 | just       |                    |
|      3 | 140301     | 2014-03-01         |
|      4 | bla        |                    |
|      1 | 12/01/1999 | 1999-12-01         |
|      2 | 13021999   |                    |
|      5 | bla        |                    |
|      4 | bla        |                    |
|      5 | bla        |                    |
|      4 | 140302     | 2014-03-02         |
+--------+------------+--------------------+

To pair that down to the date you care about, you could use Teradata's QUALIFY clause which allows you to use Window Functions to filter a result set. It's analogous to doing the window function in the SELECT clause, sticking the results in a subquery, and using a WHERE statement on the Window Function result field, but without all the extra code. Something like:

SELECT
    outkey,
    token,
    CASE 
        WHEN REGEXP_SIMILAR(token, '^[0-9,/,-]*$') = 1 
            THEN /* We've got a date... maybe */            
                CASE 
                    WHEN "LENGTH"(token) = 10 AND length(OREPLACE(token, '/', '')) = "LENGTH"(token)-2 
                        THEN /* if the formats always the same in this scenario then */
                            CAST(token AS DATE FORMAT 'mm/dd/yyyy')
                    WHEN "LENGTH"(token) = 6 
                        THEN
                            CAST(token AS DATE FORMAT 'yymmdd')
                    WHEN "LENGTH"(token) = 8
                        THEN
                            CAST(token AS DATE FORMAT 'ddmmyyyy')
                    ELSE NULL
                    END
            ELSE NULL
    END AS outdate
FROM
    (
        SELECT d.token, d.outkey
        FROM TABLE (strtok_split_to_table(uglydates.id, uglydates.uglydate, ' ')
                RETURNS (outkey integer, tokennum integer, token varchar(20)character set unicode) ) as d
    ) tokens
QUALIFY ROW_NUMBER() OVER (PARTITION BY outkey ORDER BY outdate DESC) = 1

Which will partition the data by outkey, order it by the date we figured out in descending order, then pick out the first record in that sort for each outkey. The result set would look like:

+--------+------------+------------+
| outkey |   token    |  outdate   |
+--------+------------+------------+
|      1 | 12/01/1999 | 1999-12-01 |
|      2 | 13021999   | 1999-02-13 |
|      3 | 140301     | 2014-03-01 |
|      4 | 140302     | 2014-03-02 |
|      5 | bla        |            |
+--------+------------+------------+

And with that, you are prepped for an UPDATE statement.

The cool thing about doing it all in a SQL statement is that you allow Teradata to optimize the execution path for you. Even on a large table, this will work pretty darned fast, where an SP with a cursor (and, albeit, easier to read logic) will be pretty slow. And a custom function will be limited to single amp processing (I believe that's still the case even in 15.10, right?... I don't know 100% for sure), so even if you have a properly indexed table with a good skew you will be operating the most exhausting bits of logic on a small portion of your box. @dnoeth, keep me honest.

Upvotes: 2

Related Questions