Eswar
Eswar

Reputation: 1212

Wrong row count for CSV file in python

I am processing a csv file and before that I am getting the row count using the below code.

total_rows=sum(1 for row in open(csv_file,"r",encoding="utf-8"))

The code has been written with the help given in this link. However, the total_rows doesn't match the actual number of rows in the csv file. I have found an alternative to do it but would like to know why is this not working correctly??

In the CSV file, there are cells with huge text and I have to use the encoding to avoid errors reading the csv file.

Any help is appreciated!

Upvotes: 0

Views: 2487

Answers (2)

Chris
Chris

Reputation: 29742

Let's assume you have a csv file in which some cell's a multi-line text.

$ cat example.csv
colA,colB
1,"Hi. This is Line 1.
And this is Line2"

Which, by look of it, has three lines and wc -l agrees:

$ wc -l example.csv
3 example.csv

And so does open with sum:

sum(1 for row in open('./example.csv',"r",encoding="utf-8"))
# 3

But now if you read is with some csv parser such as pandas.read_csv:

import pandas as pd

df = pd.read_csv('./example.csv')
df
   colA                                    colB
0     1  Hi. This is Line 1.\nAnd this is Line2

The other alternative way to fetch the correct number of rows is given below:

with open(csv_file,"r",encoding="utf-8") as f:
     reader = csv.reader(f,delimiter = ",")
     data = list(reader)
     row_count = len(data)

Excluding the header, the csv contains 1 line, which I believe is what you expect. This is because colB's first cell (a.k.a. huge text block) is now properly handled with the quotes wrapping the entire text.

Upvotes: 3

wdudzik
wdudzik

Reputation: 1344

I think that the problem in here is because you are not counting rows, but counting newlines (either \r\n in windows or \n in linux). The problem lies when you have a cell with text where you have newline character example:

1, "my huge text\n with many lines\n"
2, "other text"

Your method for data above will return 4 when accutaly there are only 2 rows

Try to use Pandas or other library for reading CSV files. Example:

import pandas as pd
data = pd.read_csv(pathToCsv, sep=',', header=None);
number_of_rows = len(df.index) # or df[0].count()

Note that len(df.index) and df[0].count() are not interchangeable as count excludes NaNs.

Upvotes: 1

Related Questions