Reputation: 1767
I'm having an issue importing a dataset with multiple separators. The files are mostly tab separated, but there is a single column that has around 700 values that are all semi-colon delimited.
I saw a previous similar question and the solution is simply to specify multiple separators as follows using the 'sep' argument:
dforigin = pd.read_csv(filename, header=0, skiprows=6,
skipfooter=1, sep='\t|;', engine='python')
This does not work for some reason. If I do this it just looks like a mess. Up to this point my workaround has been to import the file as tab-separated, cut out the offending column ('emg data', which is offscreen just to the right of the last column) and save as a temporary .csv, reimport the data and then append it to the initial dataframe.
My workaround feels a bit sloppy and I'm wondering if anybody can help make it a cleaner process.
Upvotes: 1
Views: 865
Reputation: 21274
IIUC, you want the semicolon-delimited values from that one column to each occupy a column in your data frame, alongside the other initial columns from your file. In that case, I'd suggest you read in the file with sep='\t'
and then split out the semicolon column afterwards.
With sample data:
data = {'foo':[1,2,3], 'bar':['a;b;c', 'i;j;k', 'x;y;z']}
df = pd.DataFrame(data)
df
bar foo
0 a;b;c 1
1 i;j;k 2
2 x;y;z 3
Concat df
with a new data frame, constructed of the splitted semicolon column:
pd.concat([df.drop('bar', 1),
df.bar.str.split(";", expand=True)], axis=1)
foo 0 1 2
0 1 a b c
1 2 i j k
2 3 x y z
Note: If your actual data don't include a column name for the semicolon-separated column, but if it's definitely the last column in the table, then per unutbu's suggestion, replace df.bar
with df.iloc[:, -1]
.
Upvotes: 3