Reputation: 289
I am trying to import a very large CSV file (well over 25M rows) into python using pandas dataframe.
The dataframe has the following columns: - dest_profile - first_name - last_name - id - con - company_name
Sometimes, there is a '\' within company_name (example: HPE\HPI) and it is causing an import error. I've added error_bad_lines=False to my pd.read_csv code. However, I want to import those rows as well.
How do I go about skipping \ within company_name column?
import pandas as pd
import numpy as np
df_1st_conns = pd.read_csv("D:\Downloads\LinkedIn\DataV2\1st_degree_nbrs.csv", error_bad_lines=False)
It thinks \ is a column deliminator. Here is the error message.
b'Skipping line 22813: expected 6 fields, saw 7\nSkipping line 62807: expected 6 fields, saw 7\n'
b'Skipping line 152688: expected 6 fields, saw 7\nSkipping line 170013: expected 6 fields, saw 7\nSkipping line 222565: expected 6 fields, saw 7\nSkipping line 222644: expected 6 fields, saw 7\nSkipping line 240790: expected 6 fields, saw 7\n'
Upvotes: 1
Views: 1041
Reputation: 5609
Perhaps you could create a new file that has all backslashes replaced with an empty string ""
or some other replacement character.
An example snippet:
input_csv_filename = "original.csv"
output_csv_filename = "no_backslashes.csv"
# Read original contents
with open(input_csv_filename, 'rb') as f:
csv_contents = f.read()
# Replace backslash with empty string
# b'\\' is the bytes literal for b'\'
csv_contents = csv_contents.replace(b'\\', b'')
# Write replaced contents to the output csv file
with open(output_csv_filename, 'wb') as f:
f.write(csv_contents)
You can then go onto read the output csv file with your code:
import pandas as pd
df = pd.read_csv(output_csv_filename)
Edit - 1: Beware that this will indiscriminately replace all backslashes in your original csv file. If you're confident that there wouldn't be backslashes anywhere else, then you can use this approach.
Edit - 2: My bad, I initially assumed that file would not contain unicode characters. I have changed by code to now deal with the files in bytes
.
Upvotes: 1