Reputation: 6450

Postgresql date formatting

I have a dataset (in .tsv format) where one of the columns is date of birth. However, the data is old, and not all dates are in YYYY-MM-DD format. Some entries just have their year of birth (month and day of birth are missing) and are in the format YYYY-##-## (literally ##'s are already inserted in the data wherever only year is known). I wish to load this dataset into my postgres database, with the date of birth column with data type as date and not string, so that I can do comparison on dates.
A small sample is shown below. (Irrelevant columns of data are not shown)

1924-##-##
1965-09-04
1944-11-05
1951-##-##
-388-##-##
1893-01-26
1037-##-##

Directly bulk loading the dataset obviously gives an error

ERROR:  invalid input syntax for type date: "1924-##-##"
LINE 1: insert into d values ('1924-##-##');
                              ^

The dataset is quite large, having around 6 crore entries. Currently I am thinking of running a script for replacing these ##'s with 01 and then inserting the modified data into the database. But I don't like this idea for -

This is time consuming.
This is disk-space consuming (as I would like to keep the original "umtampered" data)
Also, not all my data would be genuine in my database.

Is there any way I can ask postgres to somehow just take the dates as it is, by just ignoring the `##'s (and just keeping the year where the months and days are missing) ?
Or can there any better solution to this problem?

Upvotes: 0

Answers (3)

Gary - Stand with Ukraine

Reputation: 7541

You can create in your table two columns, one for the originally entered value (type of varchar), and one for calculation (type date).

CREATE TABLE your_table
(
 id INT,
 -- OTHER DETAILS
 dob_entered    VARCHAR,
 dob_parsed DATE
);

You can then use on an Insert trigger to automatically populate the date field from the varchar, with an update trigger to handle any changes.

CREATE OR REPLACE FUNCTION evaluate_dob_date() RETURNS TRIGGER AS
$$
BEGIN
    NEW.dob_parsed = CAST(REPLACE(NEW.dob_entered,'##','01') AS DATE);
    RETURN new;
END;
$$
LANGUAGE plpgsql;

CREATE TRIGGER parse_dob 
BEFORE INSERT OR UPDATE ON your_table
FOR EACH ROW
EXECUTE PROCEDURE evaluate_dob_date();

This will mean that you store both the originally entered data untouched for verification, while still having a date field in the database suitable for sorting and comparing etc. Also, by extending the evaluate_dob_date() function you can match against different cases as you find them, while still being able to reject records that are genuinely invalid.

Postgresql triggers

Upvotes: 1

user330315

Reputation:

If you need to keep the # characters, the only chance I see is to import that into a varchar column.

If you absolutely need the information as a date, you might get away with a view that does the conversion and only selects those rows where there is no # in the column.

Something like

SELECT to_date(dob,'YYYY-MM-DD') as dob_date
FROM your_table
WHERE substr(dob,6,2) <> '##';

If you do that on a regular basis, you might want to consider an index on that expression to speed up the selection:

CREATE INDEX dob_check ON your_table( substr(dob,6,2) );

Note that the expression in your select must match exactly the expression in the index in order to be usable by the query planer.

If you want to "convert" the data into a valid date during retrieval, you could do something like this:

SELECT case 
         case when substr(dob,6,2) = '##' then to_date(substr(dob,1,5)||'01-01', 'YYYY-MM-DD')
         else to_date(dob,'YYYY-MM-DD')
       end as dob_date
FROM your_table;

Upvotes: 0

dkulkarni

Reputation: 2830

You have two options here

Once you fetch the data from the database, replace all occurrences of # in your server side script and then compare the dates. (You could also replace the data dynamically in the table itself without modifying the existing data using an if condition in the query)
Ignore the dates that have a #. That way you can compare only valid dates.

Upvotes: 0

Postgresql date formatting

Answers (3)

Related Questions