Reputation: 381
I have some very basic conceptual questions related to functioning of neo4j. 1. First questions is about import tool. I am importing around 150 million nodes and a similar amount of relationships. When I do an upload the output on command terminal prints the number of nodes uploaded and then prepare node index. What is this node index? Where is it actually used? I see that the created index information is present in the graph_db=>schema=>label. What is this index and where is it actually used? Running a cypher query with does not show that index is being used anywhere. 2. Second questions is about the heap memory size of neo4j. What I understood that while running cypher queries, results are stored in heap. Once the heap is full, a garbage collection happens. What if I run a cypher statement that produces results that can not be kept in heap i.e. the result of query is bigger than the heap size. Would neo4j switch to disk? or would it produce an error. Thanks for clearing these questions in advance. Best,
Upvotes: 1
Views: 91
Reputation: 18022
What is this node index? Where is it actually used?
The index is just that - a database index. A database index is what's used to help you look up nodes really quickly. Say you put 1 million :Person
nodes into a database, then 1 million :Location
nodes in a database. When you MATCH (p:Person { last_name: "Smith" }
you want the database to search through only the :Person
nodes, and not all 2 million. The index is what makes that happen.
What is this index and where is it actually used?
The index by label is basically a searchable collection of nodes categorized by label (in this case :Person
and :Location
) that the database engine uses to speed lookups. This is a greatly simplified answer, but basically accurate. This is a very good thing, you definitely want it. Performance of getting data out of the database would be quite bad without it.
Indexes are all about trading computation time and storage for better performance. Basically, the database pre-orders all of the nodes in a certain way (which costs you up-front computation time, and also a small amount of storage on disk) in exchange for having a nice data structure in place that makes queries very fast. Generally in database terms, you'll find that if you do a lot of read-only queries (fetching data) you really, really want indexes. If your workload is mostly just adding stuff (not lookups), they're not as good.
Running a cypher query with does not show that index is being used anywhere.
Yes, it's invisible, but when you search for something in Cypher using a label, neo4j is exploiting that index. It may be invisible but it's being used to optimize your query.
What I understood that while running cypher queries, results are stored in heap
Well that's only partially true; in some senses everything in java is stored in the heap. But results stream back from the database. If you issue a query that results in 1 million results, it is not the case that all 1 million go into the heap immediately. They get pulled in blocks at a time (I don't know how many at a time, the db engine handles that). At any given time, what's in heap is the set you need right now, not everything.
What if I run a cypher statement that produces results that can not be kept in heap i.e. the result of query is bigger than the heap size
See earlier answer. You can do this without problem, because the entire set generally isn't in the heap. In database terms, we'd say you get a "cursor" back, that lets you iterate through results. You do not get a huge result set back. The gotcha here is that if you have 1million results, you can iterate through them once. Need to run through them a second time? Avoid doing that, or issue the query again.
Would neo4j switch to disk?
No - if/when any swapping to disk happened, in any case that would be an operating system decision dealing with your main memory. It's possible it would happen, but that wouldn't have much to do with neo4j.
or would it produce an error
Nope, neo4j doesn't care how big your result set it. With the "cursor" concept, you can get 1 result or 10 billion results, both will work.
Upvotes: 0