fffrost
fffrost

Reputation: 1767

pandas multiple separator not working

I'm having an issue importing a dataset with multiple separators. The files are mostly tab separated, but there is a single column that has around 700 values that are all semi-colon delimited.

I saw a previous similar question and the solution is simply to specify multiple separators as follows using the 'sep' argument:

dforigin = pd.read_csv(filename, header=0, skiprows=6, 
                       skipfooter=1, sep='\t|;', engine='python')

This does not work for some reason. If I do this it just looks like a mess. Up to this point my workaround has been to import the file as tab-separated, cut out the offending column ('emg data', which is offscreen just to the right of the last column) and save as a temporary .csv, reimport the data and then append it to the initial dataframe.

My workaround feels a bit sloppy and I'm wondering if anybody can help make it a cleaner process.

Upvotes: 1

Views: 865

Answers (1)

andrew_reece
andrew_reece

Reputation: 21274

IIUC, you want the semicolon-delimited values from that one column to each occupy a column in your data frame, alongside the other initial columns from your file. In that case, I'd suggest you read in the file with sep='\t' and then split out the semicolon column afterwards.

With sample data:

data = {'foo':[1,2,3], 'bar':['a;b;c', 'i;j;k', 'x;y;z']}
df = pd.DataFrame(data)
df
     bar  foo
0  a;b;c    1
1  i;j;k    2
2  x;y;z    3

Concat df with a new data frame, constructed of the splitted semicolon column:

pd.concat([df.drop('bar', 1), 
           df.bar.str.split(";", expand=True)], axis=1)

   foo  0  1  2
0    1  a  b  c
1    2  i  j  k
2    3  x  y  z

Note: If your actual data don't include a column name for the semicolon-separated column, but if it's definitely the last column in the table, then per unutbu's suggestion, replace df.bar with df.iloc[:, -1].

Upvotes: 3

Related Questions