ImJa
ImJa

Reputation: 125

how to import csv file to cassandra

I wanted to import csv file to cassandra so, first I created keyspace and columnfamily like this

CREATE COLUMNFAMILY Consumer_complaints(
    Date_received varchar,
    Product varchar,
    Sub_product varchar,
    Issue varchar,
    Sub_issue varchar,
    Consumer_complaint_narrative varchar,
    Company_public_response varchar,
    Company varchar,
    State varchar,
    ZIP_code varint,
    Tags varchar,
    Consumer_consent_provided varchar,
    Submitted_via varchar,
    Date_sent_to_company varchar,
    Company_response_to_consumer varchar,
    Timely_response varchar,
    Consumer_disputed varchar,
    Complaint_ID varint,
    PRIMARY KEY(Complaint_ID)
);

I got a csv file from www.data.gov called consumer complaints and then I typed the command line

COPY consumer_complaints (Date_received,Product,Sub_product, Issue, Sub_issue, Consumer_complaint_narrative, Company_public_response, Company, State, ZIP_code, Tags, Consumer_consent_provided, Submitted_via, Date_sent_to_company, Company_response_to_consumer, Timely_response, Consumer_disputed, Complaint_ID) FROM 'consumer_complaints.csv';

Sample Input

3/21/2017,Credit reporting,,Incorrect information on credit report,Information is not mine,,Company has responded to the consumer and the CFPB and chooses not to provide a public response,EXPERIAN DELAWARE GP,TX,77075,Older American,N/A,Phone,03/21/2017,Closed with non-monetary relief,Yes,No,2397100
04/19/2017,Debt collection,"Other (i.e. phone, health club, etc.)",Disclosure verification of debt,Not disclosed as an attempt to collect,,,"Security Credit Services, LLC",IL,60643,,,Web,04/20/2017,Closed with explanation,Yes,No,2441777

Error

Failed to import 1 rows: ParseError - Failed to parse 797XX : invalid lit for int() with base 10: '797XX',  given up without retries
Failed to import 1 rows: ParseError - Failed to parse 354XX : invalid lit for int() with base 10: '354XX',  given up without retries
Failed to import 2 rows: ParseError - Failed to parse 313XX : invalid lit for int() with base 10: '313XX',  given up without retries
Failed to import 2 rows: ParseError - Failed to parse 054XX : invalid lit for int() with base 10: '054XX',  given up without retries

how can I fix it??

Upvotes: 4

Views: 19970

Answers (2)

Ashraful Islam
Ashraful Islam

Reputation: 12830

Cassandra doesn't preserve the order of column when creating. You need to specify the column name when importing data.

Try this command :

COPY consumer_complaints (Date_received,Product,Sub_product, Issue, Sub_issue, Consumer_complaint_narrative, Company_public_response, Company, State, ZIP_code, Tags, Consumer_consent_provided, Submitted_via, Date_sent_to_company, Company_response_to_consumer, Timely_response, Consumer_disputed, Complaint_ID) FROM 'c.csv' WITH HEADER = true;

Sample Input :

Date_received,Product,Sub_product, Issue, Sub_issue, Consumer_complaint_narrative, Company_public_response, Company, State, ZIP_code, Tags, Consumer_consent_provided, Submitted_via, Date_sent_to_company, Company_response_to_consumer, Timely_response, Consumer_disputed, Complaint_ID
07/26/2013,Mortgage,FHA mortgage,"Loan servicing, payments, escrow account",,,,"CITIBANK, N.A.",NC,28056,,N/A,Web,07/29/2013,Closed with explanation,Yes,No,467750
09/26/2014,Consumer Loan,Vehicle loan,Managing the loan or lease,,,,HSBC NORTH AMERICA HOLDINGS INC.,NY,12572,,N/A,Web,09/26/2014,Closed with explanation,Yes,No,1046323

Output :

complaint_id  | company                          | company_public_response | company_response_to_consumer | consumer_complaint_narrative | consumer_consent_provided | consumer_disputed | date_received | date_sent_to_company | issue                                    | product       | state | sub_issue | sub_product  | submitted_via | tags | timely_response | zip_code
--------------+----------------------------------+-------------------------+------------------------------+------------------------------+---------------------------+-------------------+---------------+----------------------+------------------------------------------+---------------+-------+-----------+--------------+---------------+------+-----------------+----------
      1046323 | HSBC NORTH AMERICA HOLDINGS INC. |                    null |      Closed with explanation |                         null |                       N/A |                No |    09/26/2014 |           09/26/2014 |               Managing the loan or lease | Consumer Loan |    NY |      null | Vehicle loan |           Web | null |             Yes |    12572
       467750 |                   CITIBANK, N.A. |                    null |      Closed with explanation |                         null |                       N/A |                No |    07/26/2013 |           07/29/2013 | Loan servicing, payments, escrow account |      Mortgage |    NC |      null | FHA mortgage |           Web | null |             Yes |    28056

Edited

I checked the data from https://catalog.data.gov/dataset/consumer-complaint-database, some of the zip code has non-integer values like 797XX, 354XX, 313XX and 054XX. You can see that it's clearly not integer. You can either change these value to integer or alter your table and change the type of your ZIP_code field to varchar

Upvotes: 8

Daniel Compton
Daniel Compton

Reputation: 14559

Your error message says:

Failed to import 1 rows: ParseError - Failed to parse 11/08/2013 : invalid literal for int() with base 10: '11/08/2013', given up without retries

It looks like Cassandra is trying to insert the date string as an integer. Without seeing the CSV, I'd guess that the columns aren't in the right order and the date is being parsed as one of your varint fields. If you share a sample of the CSV it might be easier to help debug.

Upvotes: 2

Related Questions