imba22
imba22

Reputation: 649

Not reading all rows while importing csv into pandas dataframe

I am trying the kaggle challenge here, and unfortunately I am stuck at a very basic step. I am trying to read the datasets into a pandas dataframe by executing following command:

test = pd.DataFrame.from_csv("C:/Name/DataMining/hillary/data/output/emails.csv")

The problem is that this file as you would find out has over 300,000 records, but I am reading only 7945.

print (test.shape)
(7945, 21)

Now I have double checked the file and I cannot find anything special about line number 7945. Any pointers why this could be happening?

Upvotes: 8

Views: 22959

Answers (1)

jezrael
jezrael

Reputation: 862671

I think better is use function read_csv with parameters quoting=csv.QUOTE_NONE and error_bad_lines=False. link

import pandas as pd
import csv

test = pd.read_csv("output/Emails.csv", quoting=csv.QUOTE_NONE, error_bad_lines=False)

print (test.shape)
#(381422, 22)

But some data (problematic) will be skipped.

If you want skip emails body data, you can use:

import pandas as pd
import csv

test = pd.read_csv(
    "output/Emails.csv",
    quoting=csv.QUOTE_NONE,
    sep=',',
    error_bad_lines=False,
    header=None,
    names=[
        "Id", "DocNumber", "MetadataSubject", "MetadataTo", "MetadataFrom",
        "SenderPersonId", "MetadataDateSent", "MetadataDateReleased",
        "MetadataPdfLink", "MetadataCaseNumber", "MetadataDocumentClass",
        "ExtractedSubject", "ExtractedTo", "ExtractedFrom", "ExtractedCc",
        "ExtractedDateSent", "ExtractedCaseNumber", "ExtractedDocNumber",
        "ExtractedDateReleased", "ExtractedReleaseInPartOrFull",
        "ExtractedBodyText", "RawText"])

print (test.shape)

#delete row with NaN in column MetadataFrom
test = test.dropna(subset=['MetadataFrom'])
#delete headers in data
test = test[test.MetadataFrom != 'MetadataFrom']

Upvotes: 18

Related Questions