Reputation: 1088
I need to automate a process to load new data files into a database. My question is about the best way to determine which files are "new" in an automated fashion.
Files are retrieved from a directory that is synced nightly, so the list of files keeps growing. I don't have the option to wipe out files that I have already retrieved.
New records are stored in a raw data table that has a field indicating the filename where each record originated, so I could compare all filenames currently in the directory with filenames already in the raw data table, and process only those filenames that aren't in common.
Or I could use timestamps that are in the filenames, and process only those files that were created since the last time the import process was run.
I am leaning toward using the first approach since it seems less prone to error, but I haven't had much luck finding whether this is actually true. What are the pitfalls of determining new files in this manner, by comparing all filenames with the filenames already in the database?
Upvotes: 1
Views: 1653
Reputation: 12465
File name comparison:
Get filenames by date:
To do either of the above, you can use the following Pentaho step.
Configuration Get File Names step:
File/Directory: Give the folder path contains the files.
Wildcard (RegExp): .*\.*
to get all or .*\.pdf
to get specific
format.
Upvotes: 1