Reputation: 6450
I have a dataset (in .tsv format) where one of the columns is date of birth. However, the data is old, and not all dates are in YYYY-MM-DD
format. Some entries just have their year of birth (month and day of birth are missing) and are in the format YYYY-##-##
(literally ##
's are already inserted in the data wherever only year is known). I wish to load this dataset into my postgres database, with the date of birth column with data type as date
and not string
, so that I can do comparison on dates.
A small sample is shown below. (Irrelevant columns of data are not shown)
1924-##-##
1965-09-04
1944-11-05
1951-##-##
-388-##-##
1893-01-26
1037-##-##
Directly bulk loading the dataset obviously gives an error
ERROR: invalid input syntax for type date: "1924-##-##"
LINE 1: insert into d values ('1924-##-##');
^
The dataset is quite large, having around 6 crore entries. Currently I am thinking of running a script for replacing these ##
's with 01
and then inserting the modified data into the database. But I don't like this idea for -
Is there any way I can ask postgres to somehow just take the dates as it is, by just ignoring the `##'s (and just keeping the year where the months and days are missing) ?
Or can there any better solution to this problem?
Upvotes: 0
Views: 3581
Reputation: 7541
You can create in your table two columns, one for the originally entered value (type of varchar), and one for calculation (type date).
CREATE TABLE your_table
(
id INT,
-- OTHER DETAILS
dob_entered VARCHAR,
dob_parsed DATE
);
You can then use on an Insert trigger to automatically populate the date field from the varchar, with an update trigger to handle any changes.
CREATE OR REPLACE FUNCTION evaluate_dob_date() RETURNS TRIGGER AS
$$
BEGIN
NEW.dob_parsed = CAST(REPLACE(NEW.dob_entered,'##','01') AS DATE);
RETURN new;
END;
$$
LANGUAGE plpgsql;
CREATE TRIGGER parse_dob
BEFORE INSERT OR UPDATE ON your_table
FOR EACH ROW
EXECUTE PROCEDURE evaluate_dob_date();
This will mean that you store both the originally entered data untouched for verification, while still having a date field in the database suitable for sorting and comparing etc. Also, by extending the evaluate_dob_date()
function you can match against different cases as you find them, while still being able to reject records that are genuinely invalid.
Upvotes: 1
Reputation:
If you need to keep the #
characters, the only chance I see is to import that into a varchar
column.
If you absolutely need the information as a date, you might get away with a view that does the conversion and only selects those rows where there is no #
in the column.
Something like
SELECT to_date(dob,'YYYY-MM-DD') as dob_date
FROM your_table
WHERE substr(dob,6,2) <> '##';
If you do that on a regular basis, you might want to consider an index on that expression to speed up the selection:
CREATE INDEX dob_check ON your_table( substr(dob,6,2) );
Note that the expression in your select must match exactly the expression in the index in order to be usable by the query planer.
If you want to "convert" the data into a valid date during retrieval, you could do something like this:
SELECT case
case when substr(dob,6,2) = '##' then to_date(substr(dob,1,5)||'01-01', 'YYYY-MM-DD')
else to_date(dob,'YYYY-MM-DD')
end as dob_date
FROM your_table;
Upvotes: 0
Reputation: 2830
You have two options here
Upvotes: 0