Reputation: 6757
So basically I have written a program which generates test data for MongoDB in Node.
For that, the program reads a schema file and generates a specified amount of test data out of it. The problem is that this data can eventually become quite big (Think about creating 1M Users (with all properties it needs) and 20M chat messages (with userFrom
and userTo
) and it has to keep all of that in the RAM to modify/transform/map it and after that save it to a file.
The program works like that:
referenceTo
to a random object with matching referenceKey
.string[]
of MongoDB insert statementsstring[]
in a file.This is the structure of the generated test data:
export interface IGeneratedCollection {
dbName: string, // Name of the database
collectionName: string, // Name of the collection
documents: IGeneratedDocument[] // One collection has many documents
}
export interface IGeneratedDocument {
documentFields: IGeneratedField [] // One document has many fields (which are recursive, because of nested documents)
}
export interface IGeneratedField {
fieldName: string, // Name of the property
fieldValue: any, // Value of the property (Can also be IGeneratedField, IGeneratedField[], ...)
fieldNeedsQuotations?: boolean, // If the Value needs to be saved with " ... "
fieldIsObject?: boolean, // If the Value is a object (stored as IGeneratedField[]) (To handle it different when transforming to MongoDB inserts)
fieldIsJsonObject?: boolean, // If the Value is a plain JSON object
fieldIsArray?: boolean, // If the Value is array of objects (stored as array of IGeneratedField[])
referenceKey?: number, // Field flagged to be a key
referenceTo?: number // Value gets set to a random object with matching referenceKey
}
So in the example with 1M Users and 20M messages it would look like this:
collectionName = "users"
)
collectionName = "messages"
)
message, userFrom, userTo
)hich would result in 190M instances of IGeneratedField
(1x1Mx10 + 1x20Mx3x = 190M
).
This is obviously a lot to handle for the RAM as it needs to store all of that at the same time.
Temporary Solution
It now works like that:
- Generate 500 documents(rows in sql) at a time
JSON.stringify
those 500 documents and put them in a SQLite table with the schema (dbName STRING, collectionName STRING, value JSON)- Remove those 500 documents from JS and let the Garbage Collector do its thing
- Repeat until all data is generated and in the SQLite table
- Take one of the rows (each containing 500 documents) at a time, apply
JSON.parse
and search for keys in them- Repeat until all data is queried and all keys retrieved
- Take one of the rows at a time, apply
JSON.parse
and search for key references in them- Apply
JSON.stringify
and update the row if necessary (if key references found and resolved)- Repeat until all data is queried and all keys are resolved
- Take one of the rows at a time, apply
JSON.parse
and transform the documents to valid sql/mongodb inserts- Add the insert (string) in a SQLite table with the schema (singleInsert STRING)
- Remove the old and now unused row from the SQLite table
- Write all inserts to file (if run from the command line) or return a dataHandle to query the data in the SQLite table (if run from other node app)
This solution does handle the problem with RAM, because SQLite automatically swaps to the Harddrive when the RAM is full
BUT
As you can see there are a lot of
JSON.parse
andJSON.stringify
involved, which slows down the whole process drasticallyWhat I have thought:
Maybe I should modify the IGeneratedField to only use shortend names as variables (
fieldName
->fn
,fieldValue
->fv
,fieldIsObject
->fio
,fieldIsArray
->fia
, ....)This would make the needed storage in the SQLite table smaller, BUT it would also make the code harder to read
Use a document oriented database (But I have not really found one), to handle JSON data better
Is there any better solution to handle big objects like this in node?
Is my temporary solution OK? What is bad about it? Can it be changed to perform better?
Upvotes: 6
Views: 4040
Reputation: 953
Conceptually, generate items in a stream.
You don't need all 1M users in db. You could add 10k at a time.
For the messages, random sample 2n users from db, those send messages to each other. Repeat till satisfied.
Example:
// Assume Users and Messages are both db.collections
// Assume functions generateUser() and generateMessage(u1, u2) exist.
const desiredUsers = 10000;
const desiredMessages = 5000000;
const blockSize = 1000;
(async () => {
for (const i of _.range(desiredUsers / blockSize) ) {
const users = _.range(blockSize).map(generateUser);
await Users.insertMany(users);
}
for (const i of _.range(desiredMessages / blockSize) ) {
const users = await Users.aggregate([ { $sample: { size: 2 * blockSize } } ]).toArray();
const messages = _.chunk(users, 2).map( (usr) => generateMessage(usr[0], usr[1]));
await Messages.insertMany(messages);
}
})();
Depending on how you tweak the stream, you get a different distribution. This is uniform distribution. You can get more long tailed distribution by interleaving the users and messages. For example, you might want to do this for message boards.
Went to 200MB after i switched the blockSize to 1000.
Upvotes: 2