nameBrandon
nameBrandon

Reputation: 117

Basic / conceptual issues, query performance with Cypher and Neo4J

I'm doing a project on credit card fraud, and I've got some generated sample data in .CSV (pipe delimited) where each line is basically the person's info, the transaction details along with the merchant name, etc.. Since this is generated data, there's also a flag that indicates if this transaction was fraudulent or not.

What I'm attempting to do is to load the data into Neo4j, create nodes (persons, transactions, and merchants), and then visualize a graph of the fraudulent charges to see if there are any common merchants. (I am aware there is a sample neo4j data set similar to this, but I'm attempting to apply this concept to a separate project).

I load the data in, create constraints, and them attempt my query, which seems to run forever.

Here are a few lines of example data..

ssn|cc_num|first|last|gender|street|city|state|zip|lat|long|city_pop|job|dob|acct_num|profile|trans_num|trans_date|trans_time|unix_time|category|amt|is_fraud|merchant|merch_lat|merch_long
692-42-2939|5270441615999263|Eliza|Stokes|F|684 Abigayle Port Suite 372|Tucson|AZ|85718|32.3112|-110.9179|865276|Science writer|1962-12-06|563973647649|40_60_bigger_cities.json|2e5186427c626815e47725e59cb04c9f|2013-03-21|02:01:05|1363831265|misc_net|838.47|1|fraud_Greenfelder, Bartoletti and Davis|31.616203|-110.221915
692-42-2939|5270441615999263|Eliza|Stokes|F|684 Abigayle Port Suite 372|Tucson|AZ|85718|32.3112|-110.9179|865276|Science writer|1962-12-06|563973647649|40_60_bigger_cities.json|7d3f5eae923428c51b6bb396a3b50aab|2013-03-22|22:36:52|1363991812|shopping_net|907.03|1|fraud_Gerlach Inc|32.142740|-111.675048
692-42-2939|5270441615999263|Eliza|Stokes|F|684 Abigayle Port Suite 372|Tucson|AZ|85718|32.3112|-110.9179|865276|Science writer|1962-12-06|563973647649|40_60_bigger_cities.json|76083345f18c5fa4be6e51e4d0ea3580|2013-03-22|16:40:20|1363970420|shopping_pos|912.03|1|fraud_Morissette PLC|31.909227|-111.3878746

The sample file I'm using has about 60k transactions

Below is my cypher query / code thus far.

 USING PERIODIC COMMIT 1000
 LOAD CSV WITH HEADERS FROM "card_data.csv"
 AS line FIELDTERMINATOR '|'


 CREATE (p:Person { id: toInt(line.cc_num), name_first: line.first, name_last: line.last })
 CREATE (m:Merchant { id: line.merchant, name: line.merchant })
 CREATE (t:Transaction { id: line.trans_num, merchant_name: line.merchant, card_number:line.cc_num, amount:line.amt, is_fraud:line.is_fraud, trans_date:line.trans_date, trans_time:line.trans_time })

 create constraint on (t:Transaction) assert t.trans_num is unique;
 create constraint on (p:Person) assert p.cc_num is unique;

 MATCH (m:Merchant)
 WITH m
 MATCH (t:Transaction{merchant_name:m.merchant,is_fraud:1})
 CREATE (m)-[:processed]->(t)

You can see in the 2nd MATCH query, I am attempting to specify that we only examine fraudulent transactions (is_fraud:1), and of the roughly 65k transactions, 230 have is_fraud:1.

Any ideas why this query would seen to run endlessly? I do have MUCH larger sets of data I'd like to examine this way, and the small data results thus far are not promising (I'm sure due to my lack of understanding, not Neo4j's fault).

Upvotes: 0

Views: 76

Answers (2)

Michael Hunger
Michael Hunger

Reputation: 41706

You create duplicate entries both for merchants as well as for people.

// not really needed if you don't merge transactions
// and if you don't look up transactions by trans_num
// create constraint on (t:Transaction) assert t.trans_num is unique;

// can't a person use multiple credit cards?
create constraint on (p:Person) assert p.cc_num is unique;

create constraint on (p:Person) assert p.id is unique;
create constraint on (m:Merchant) assert m.id is unique;




USING PERIODIC COMMIT 1000
 LOAD CSV WITH HEADERS FROM "card_data.csv" AS line FIELDTERMINATOR '|'


MERGE (p:Person { id: toInt(line.cc_num)})
   ON CREATE SET p.name_first=line.first, p.name_last=line.las
MERGE (m:Merchant { id: line.merchant}) ON CREATE SET m.name = line.merchant

CREATE (t:Transaction { id: line.trans_num, card_number:line.cc_num, amount:line.amt, merchant_name: line.merchant,
        is_fraud:line.is_fraud, trans_date:line.trans_date, trans_time:line.trans_time })

CREATE (p)-[:issued]->(t)

// only connect fraudulent transactions to the merchant
WHERE t.is_fraud = 1
// also add indicator label to transaction for easier selection / processing later
SET t:Fraudulent     
CREATE (m)-[:processed]->(t);

Alternatively you can connect all tx to the merchant and indicate the fraud only via label / alternative rel-types.

Upvotes: 0

David Makogon
David Makogon

Reputation: 71091

You don't show any index creation. To speed things up, you should create an index on both merchant_name and is_fraud, to avoid going through all transaction nodes sequentially for a given merchant:

CREATE INDEX ON :Transaction(merchant_name)
CREATE INDEX ON :Transaction(is_fraud)

Upvotes: 0

Related Questions