Reputation: 117
I'm doing a project on credit card fraud, and I've got some generated sample data in .CSV (pipe delimited) where each line is basically the person's info, the transaction details along with the merchant name, etc.. Since this is generated data, there's also a flag that indicates if this transaction was fraudulent or not.
What I'm attempting to do is to load the data into Neo4j, create nodes (persons, transactions, and merchants), and then visualize a graph of the fraudulent charges to see if there are any common merchants. (I am aware there is a sample neo4j data set similar to this, but I'm attempting to apply this concept to a separate project).
I load the data in, create constraints, and them attempt my query, which seems to run forever.
Here are a few lines of example data..
ssn|cc_num|first|last|gender|street|city|state|zip|lat|long|city_pop|job|dob|acct_num|profile|trans_num|trans_date|trans_time|unix_time|category|amt|is_fraud|merchant|merch_lat|merch_long
692-42-2939|5270441615999263|Eliza|Stokes|F|684 Abigayle Port Suite 372|Tucson|AZ|85718|32.3112|-110.9179|865276|Science writer|1962-12-06|563973647649|40_60_bigger_cities.json|2e5186427c626815e47725e59cb04c9f|2013-03-21|02:01:05|1363831265|misc_net|838.47|1|fraud_Greenfelder, Bartoletti and Davis|31.616203|-110.221915
692-42-2939|5270441615999263|Eliza|Stokes|F|684 Abigayle Port Suite 372|Tucson|AZ|85718|32.3112|-110.9179|865276|Science writer|1962-12-06|563973647649|40_60_bigger_cities.json|7d3f5eae923428c51b6bb396a3b50aab|2013-03-22|22:36:52|1363991812|shopping_net|907.03|1|fraud_Gerlach Inc|32.142740|-111.675048
692-42-2939|5270441615999263|Eliza|Stokes|F|684 Abigayle Port Suite 372|Tucson|AZ|85718|32.3112|-110.9179|865276|Science writer|1962-12-06|563973647649|40_60_bigger_cities.json|76083345f18c5fa4be6e51e4d0ea3580|2013-03-22|16:40:20|1363970420|shopping_pos|912.03|1|fraud_Morissette PLC|31.909227|-111.3878746
The sample file I'm using has about 60k transactions
Below is my cypher query / code thus far.
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "card_data.csv"
AS line FIELDTERMINATOR '|'
CREATE (p:Person { id: toInt(line.cc_num), name_first: line.first, name_last: line.last })
CREATE (m:Merchant { id: line.merchant, name: line.merchant })
CREATE (t:Transaction { id: line.trans_num, merchant_name: line.merchant, card_number:line.cc_num, amount:line.amt, is_fraud:line.is_fraud, trans_date:line.trans_date, trans_time:line.trans_time })
create constraint on (t:Transaction) assert t.trans_num is unique;
create constraint on (p:Person) assert p.cc_num is unique;
MATCH (m:Merchant)
WITH m
MATCH (t:Transaction{merchant_name:m.merchant,is_fraud:1})
CREATE (m)-[:processed]->(t)
You can see in the 2nd MATCH query, I am attempting to specify that we only examine fraudulent transactions (is_fraud:1), and of the roughly 65k transactions, 230 have is_fraud:1.
Any ideas why this query would seen to run endlessly? I do have MUCH larger sets of data I'd like to examine this way, and the small data results thus far are not promising (I'm sure due to my lack of understanding, not Neo4j's fault).
Upvotes: 0
Views: 76
Reputation: 41706
You create duplicate entries both for merchants as well as for people.
// not really needed if you don't merge transactions
// and if you don't look up transactions by trans_num
// create constraint on (t:Transaction) assert t.trans_num is unique;
// can't a person use multiple credit cards?
create constraint on (p:Person) assert p.cc_num is unique;
create constraint on (p:Person) assert p.id is unique;
create constraint on (m:Merchant) assert m.id is unique;
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "card_data.csv" AS line FIELDTERMINATOR '|'
MERGE (p:Person { id: toInt(line.cc_num)})
ON CREATE SET p.name_first=line.first, p.name_last=line.las
MERGE (m:Merchant { id: line.merchant}) ON CREATE SET m.name = line.merchant
CREATE (t:Transaction { id: line.trans_num, card_number:line.cc_num, amount:line.amt, merchant_name: line.merchant,
is_fraud:line.is_fraud, trans_date:line.trans_date, trans_time:line.trans_time })
CREATE (p)-[:issued]->(t)
// only connect fraudulent transactions to the merchant
WHERE t.is_fraud = 1
// also add indicator label to transaction for easier selection / processing later
SET t:Fraudulent
CREATE (m)-[:processed]->(t);
Alternatively you can connect all tx to the merchant and indicate the fraud only via label / alternative rel-types.
Upvotes: 0
Reputation: 71091
You don't show any index creation. To speed things up, you should create an index on both merchant_name
and is_fraud
, to avoid going through all transaction nodes sequentially for a given merchant:
CREATE INDEX ON :Transaction(merchant_name)
CREATE INDEX ON :Transaction(is_fraud)
Upvotes: 0