Reputation: 6260
Following an old question of mine. I finally identified what happens.
I have a csv-file which has the sperator \t
and reading it with the following command:
df = pd.read_csv(r'C:\..\file.csv', sep='\t', encoding='unicode_escape')
the length for example is: 800.000
The problem is the original file has around 1.400.000 lines, and I also know where the issue occures, one column (let's say columnA) has the following entry:
"HILFE FüR DIE Alten
Do you have any idea what is happening? When I delete that row I get the correct number of lines (length), what is python doing here?
Upvotes: 2
Views: 625
Reputation: 164
According to pandas documentation https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
sep : str, default ‘,’ Delimiter to use. If sep is None, the C engine cannot automatically detect the separator, but the Python parsing engine can, meaning the latter will be used and automatically detect the separator by Python’s builtin sniffer tool, csv.Sniffer. In addition, separators longer than 1 character and different from '\s+' will be interpreted as regular expressions and will also force the use of the Python parsing engine. Note that regex delimiters are prone to ignoring quoted data. Regex example: '\r\t'.
It may be issue with double quotes symbol. Try this instead:
df = pd.read_csv(r'C:\..\file.csv', sep='\\t', encoding='unicode_escape', engine='python')
or this:
df = pd.read_csv(r'C:\..\file.csv', sep=r'\t', encoding='unicode_escape')
Upvotes: 1