Nitin Kumar
Nitin Kumar

Reputation: 381

Neo4j import tool and querying

I have some very basic conceptual questions related to functioning of neo4j. 1. First questions is about import tool. I am importing around 150 million nodes and a similar amount of relationships. When I do an upload the output on command terminal prints the number of nodes uploaded and then prepare node index. What is this node index? Where is it actually used? I see that the created index information is present in the graph_db=>schema=>label. What is this index and where is it actually used? Running a cypher query with does not show that index is being used anywhere. 2. Second questions is about the heap memory size of neo4j. What I understood that while running cypher queries, results are stored in heap. Once the heap is full, a garbage collection happens. What if I run a cypher statement that produces results that can not be kept in heap i.e. the result of query is bigger than the heap size. Would neo4j switch to disk? or would it produce an error. Thanks for clearing these questions in advance. Best,

Upvotes: 1

Views: 91

Answers (1)

FrobberOfBits
FrobberOfBits

Reputation: 18022

What is this node index? Where is it actually used?

The index is just that - a database index. A database index is what's used to help you look up nodes really quickly. Say you put 1 million :Person nodes into a database, then 1 million :Location nodes in a database. When you MATCH (p:Person { last_name: "Smith" } you want the database to search through only the :Person nodes, and not all 2 million. The index is what makes that happen.

Read up on indexes in neo4j

What is this index and where is it actually used?

The index by label is basically a searchable collection of nodes categorized by label (in this case :Person and :Location) that the database engine uses to speed lookups. This is a greatly simplified answer, but basically accurate. This is a very good thing, you definitely want it. Performance of getting data out of the database would be quite bad without it.

Indexes are all about trading computation time and storage for better performance. Basically, the database pre-orders all of the nodes in a certain way (which costs you up-front computation time, and also a small amount of storage on disk) in exchange for having a nice data structure in place that makes queries very fast. Generally in database terms, you'll find that if you do a lot of read-only queries (fetching data) you really, really want indexes. If your workload is mostly just adding stuff (not lookups), they're not as good.

Running a cypher query with does not show that index is being used anywhere.

Yes, it's invisible, but when you search for something in Cypher using a label, neo4j is exploiting that index. It may be invisible but it's being used to optimize your query.

What I understood that while running cypher queries, results are stored in heap

Well that's only partially true; in some senses everything in java is stored in the heap. But results stream back from the database. If you issue a query that results in 1 million results, it is not the case that all 1 million go into the heap immediately. They get pulled in blocks at a time (I don't know how many at a time, the db engine handles that). At any given time, what's in heap is the set you need right now, not everything.

What if I run a cypher statement that produces results that can not be kept in heap i.e. the result of query is bigger than the heap size

See earlier answer. You can do this without problem, because the entire set generally isn't in the heap. In database terms, we'd say you get a "cursor" back, that lets you iterate through results. You do not get a huge result set back. The gotcha here is that if you have 1million results, you can iterate through them once. Need to run through them a second time? Avoid doing that, or issue the query again.

Would neo4j switch to disk?

No - if/when any swapping to disk happened, in any case that would be an operating system decision dealing with your main memory. It's possible it would happen, but that wouldn't have much to do with neo4j.

or would it produce an error

Nope, neo4j doesn't care how big your result set it. With the "cursor" concept, you can get 1 result or 10 billion results, both will work.

Upvotes: 0

Related Questions