Basic / conceptual issues, query performance with Cypher and Neo4J

Question

I'm doing a project on credit card fraud, and I've got some generated sample data in .CSV (pipe delimited) where each line is basically the person's info, the transaction details along with the merchant name, etc.. Since this is generated data, there's also a flag that indicates if this transaction was fraudulent or not.

What I'm attempting to do is to load the data into Neo4j, create nodes (persons, transactions, and merchants), and then visualize a graph of the fraudulent charges to see if there are any common merchants. (I am aware there is a sample neo4j data set similar to this, but I'm attempting to apply this concept to a separate project).

I load the data in, create constraints, and them attempt my query, which seems to run forever.

Here are a few lines of example data..

ssn|cc_num|first|last|gender|street|city|state|zip|lat|long|city_pop|job|dob|acct_num|profile|trans_num|trans_date|trans_time|unix_time|category|amt|is_fraud|merchant|merch_lat|merch_long
692-42-2939|5270441615999263|Eliza|Stokes|F|684 Abigayle Port Suite 372|Tucson|AZ|85718|32.3112|-110.9179|865276|Science writer|1962-12-06|563973647649|40_60_bigger_cities.json|2e5186427c626815e47725e59cb04c9f|2013-03-21|02:01:05|1363831265|misc_net|838.47|1|fraud_Greenfelder, Bartoletti and Davis|31.616203|-110.221915
692-42-2939|5270441615999263|Eliza|Stokes|F|684 Abigayle Port Suite 372|Tucson|AZ|85718|32.3112|-110.9179|865276|Science writer|1962-12-06|563973647649|40_60_bigger_cities.json|7d3f5eae923428c51b6bb396a3b50aab|2013-03-22|22:36:52|1363991812|shopping_net|907.03|1|fraud_Gerlach Inc|32.142740|-111.675048
692-42-2939|5270441615999263|Eliza|Stokes|F|684 Abigayle Port Suite 372|Tucson|AZ|85718|32.3112|-110.9179|865276|Science writer|1962-12-06|563973647649|40_60_bigger_cities.json|76083345f18c5fa4be6e51e4d0ea3580|2013-03-22|16:40:20|1363970420|shopping_pos|912.03|1|fraud_Morissette PLC|31.909227|-111.3878746

The sample file I'm using has about 60k transactions

Below is my cypher query / code thus far.

 USING PERIODIC COMMIT 1000
 LOAD CSV WITH HEADERS FROM "card_data.csv"
 AS line FIELDTERMINATOR '|'


 CREATE (p:Person { id: toInt(line.cc_num), name_first: line.first, name_last: line.last })
 CREATE (m:Merchant { id: line.merchant, name: line.merchant })
 CREATE (t:Transaction { id: line.trans_num, merchant_name: line.merchant, card_number:line.cc_num, amount:line.amt, is_fraud:line.is_fraud, trans_date:line.trans_date, trans_time:line.trans_time })

 create constraint on (t:Transaction) assert t.trans_num is unique;
 create constraint on (p:Person) assert p.cc_num is unique;

 MATCH (m:Merchant)
 WITH m
 MATCH (t:Transaction{merchant_name:m.merchant,is_fraud:1})
 CREATE (m)-[:processed]->(t)

You can see in the 2nd MATCH query, I am attempting to specify that we only examine fraudulent transactions (is_fraud:1), and of the roughly 65k transactions, 230 have is_fraud:1.

Any ideas why this query would seen to run endlessly? I do have MUCH larger sets of data I'd like to examine this way, and the small data results thus far are not promising (I'm sure due to my lack of understanding, not Neo4j's fault).

David Makogon · Accepted Answer

You don't show any index creation. To speed things up, you should create an index on both merchant_name and is_fraud, to avoid going through all transaction nodes sequentially for a given merchant:

CREATE INDEX ON :Transaction(merchant_name)
CREATE INDEX ON :Transaction(is_fraud)

Basic / conceptual issues, query performance with Cypher and Neo4J

Answers (2)

Related Questions