Reputation: 229
I have a CSV file with ";" as separator made like this:
col1;col2;col3;col4
4;hello;world;1;1
4;hi;1;1
4;hi;1;1
4;hi;1;1
Obviously by using ";" as sep it gives me the error about tokenizing data (obviously from the header less columns are expected), how can i obtain a dataframe like this:
col1 col2 col3 col4
4 hello;world 1 1
4 hi 1 1
4 hi 1 1
4 hi 1 1
It could be read even with others packages and other data type (even if i prefer pandas because of the following operations in the code)
Upvotes: 3
Views: 8729
Reputation: 46759
You could split off the outer cols until you are left with the remaining col2
. This could be done in Pandas as follows:
import pandas as pd
df_raw = pd.read_csv("input.csv", delimiter=None, header=None, skiprows=1)
df_raw[['col1', 'rest']] = df_raw[0].str.split(";", n=1, expand=True)
df_raw[['col2', 'col3', 'col4']] = df_raw.rest.str.rsplit(";", n=2, expand=True)
df = df_raw[['col1', 'col2', 'col3', 'col4']]
print(df)
Giving df
as:
col1 col2 col3 col4
0 4 hello;world 1 1
1 4 hi 1 1
2 4 hi 1 1
3 4 hi 1 1
First read in the CSV file without using any delimiters to get a single column.
Use a .str.split()
with n=1
to split out just col1
using the ;
delimiter from the left.
Take the remaining rest
and apply .str.rsplit()
with n=2
to do a reverse split using the ;
delimiter to get the remaining columns. This allows col2
to have any have any characters.
This assume that only col2
can have additional ;
separators and the last two are fixed.
Upvotes: 2
Reputation: 1749
import re
pattern = r"(?<=;)(?P<second_column>[\w]+;[\w]+)(?=;)"
with open("bad_csv.csv") as file:
text = file.readlines()
for i in range(len(text)):
if text[i].count(';') == 4:
text[i] = re.sub(pattern, '"' + r"\g<second_column>" + '"', text[i], 1)
with open("good_csv.csv", "w") as file:
for row in text:
file.writelines(row)
df = pd.read_csv("good_csv.csv", sep=';')
Upvotes: 1