Reputation: 2082
I'm integrating with external system.
From it I have 3 files:
customer_data.csv
address_data.csv
additional_customer_data.csv
Order in each of them can be random.
There is:
relation one to many (customer_data
=> addresses
) but I am interested only in one address with specified kind.
one to one (customer_data
=> additional_customer_data
)
Goal:
Merge files together and put it in one index in
Elastic search.
Additional info:
-each file has circa 1 million records
-this operation will be done each night
-data is used only for search purposes
Options:
a) I thought about:
Parse and add to ES first file
Do the same from next and update document created in point one
Looks very inefficient.
b) another way:
parse and add first file to relational data base
do same with another fields and update records from point one
Propagate data to ES
Can you see another options?
Upvotes: 0
Views: 399
Reputation: 426
I assume you have a normalized relational data structure with 1 to n relationships in those CSV files like that:
customer_data.csv
Id;Name;AdressId;AdditionalCustomerDataId;...
0;Mike;2;1;...
address_data.csv
Id;Street;City;...
....
2;Abbey Road;London;...
additional_customer_data.csv
Id;someData;...
...
1;data;...
In that case, I would denormalize those in a preprocessing step into one single CSV and use that to upload them to ES. For avoiding downtime, you can then use aliases. Preprocessing can be done in any language, but probably converting the CSVs into a sqlite table will be the fastest.
I wouldn't choose a strategy to create just half of the document and add the additional information later, as you probably need to reindex afterwards.
However, maybe you can tell us more about the requirements and the external system, cause this doesn't seem to be a great solution.
Upvotes: 1