usr_lal123
usr_lal123

Reputation: 838

Ignoring multiple commas while reading csv in pandas

I m trying to read multiple files whose names start with 'site_%'. Example, file names like site_1, site_a. Each file has data like :

Login_id, Web
1,http://www.x1.com
2,http://www.x1.com,as.php

I need two columns in my pandas df: Login_id and Web.

I am facing error when I try to read records like 2.

df_0 = pd.read_csv('site_1',sep='|')
df_0[['Login_id, Web','URL']] = df_0['Login_id, Web'].str.split(',',expand=True)

I am facing the following error : ValueError: Columns must be same length as key.

Please let me know where I am doing some serious mistake and any good approach to solve the problem. Thanks

Upvotes: 1

Views: 541

Answers (1)

jottbe
jottbe

Reputation: 4521

Solution 1: use split with argument n=1 and expand=True.

result= df['Login_id, Web'].str.split(',', n=1, expand=True)
result.columns= ['Login_id', 'Web']

That results in a dataframe with two columns, so if you have more columns in your dataframe, you need to concat it with your original dataframe (that also applies to the next method).

EDIT Solution 2: there is a nicer regex-based solution which uses a pandas function:

result= df['Login_id, Web'].str.extract('^\s*(?P<Login_id>[^,]*),\s*(?P<URL>.*)', expand=True)

This splits the field and uses the names of the matching groups to create columns with their content. The output is:

  Login_id                       URL
0        1         http://www.x1.com
1        2  http://www.x1.com,as.php

Solution 3: convetional version with regex: You could do something customized, e.g with a regex:

import re
sp_re= re.compile('([^,]*),(.*)')

aux_series= df['Login_id, Web'].map(lambda val: sp_re.match(val).groups())
df['Login_id']= aux_series.str[0]
df['URL']= aux_series.str[1]

The result on your example data is:

                Login_id, Web Login_id                       URL
0         1,http://www.x1.com        1         http://www.x1.com
1  2,http://www.x1.com,as.php        2  http://www.x1.com,as.php

Now you could drop the column 'Login_id, Web'.

Upvotes: 1

Related Questions