ETL file loading: files created today, or files not already loaded?

Question

I need to automate a process to load new data files into a database. My question is about the best way to determine which files are "new" in an automated fashion.

Files are retrieved from a directory that is synced nightly, so the list of files keeps growing. I don't have the option to wipe out files that I have already retrieved.

New records are stored in a raw data table that has a field indicating the filename where each record originated, so I could compare all filenames currently in the directory with filenames already in the raw data table, and process only those filenames that aren't in common.

Or I could use timestamps that are in the filenames, and process only those files that were created since the last time the import process was run.

I am leaning toward using the first approach since it seems less prone to error, but I haven't had much luck finding whether this is actually true. What are the pitfalls of determining new files in this manner, by comparing all filenames with the filenames already in the database?

Marlon Abeykoon · Accepted Answer

File name comparison:

If you have millions of files then comparison might not what you are looking for.
You must be sure that the files in the said folder never gets deleted.

Get filenames by date:

Since these filenames are retrieved once a day can guarantee the accuracy. (Even they created in millisecond difference)
Will be efficient if many files are there.
Pentaho gives the modified date not the created date.

To do either of the above, you can use the following Pentaho step.

Configuration Get File Names step:

File/Directory: Give the folder path contains the files.
Wildcard (RegExp): .*\.* to get all or .*\.pdf to get specific
format.

ETL file loading: files created today, or files not already loaded?

Answers (1)

Related Questions