Akanshya Bapat
Akanshya Bapat

Reputation: 55

Split a string from a column in pd.series format python

I am new to Python and was trying to do some stuff to do hands on on it.

While doing this I am stuck here.

I have a data in .csv format which I imported to python using

data = pandas.read_csv("data.csv")
data.head()

   user  rating      id
0     1     3.5  1_1193
1     1     3.5   1_661
2     1     3.5   1_914
3     1     3.5  1_3408
4     1     3.5  1_2355

What I need is from the 'id' column I should get the number which is after '_'.

What I have tried doing is:

data.id.split('_')

which gave me error: "'DataFrame' object has no attribute 'split'"

Hence, I made the 'id' column as np.array after reading it from some solution on stackoverflow.

s1 = data.id.values
s2 = np.array2string(s1, separator=',',suppress_small=True)
s2.split('_')

This gives me output as:

["['1",
 "1193','1",
 "661','1",
 "914',..., '6040",
 "161','6040",
 "2725','6040",
 "1784']"]
s2.split('_')[1] 

gave me:

"1193','1"

what should I do to get the string after "_"?

Upvotes: 2

Views: 826

Answers (2)

jezrael
jezrael

Reputation: 863801

You need vectorized str.split with selecting second lists by str[1] - also you can check docs:

data['a'] = data.id.str.split('_').str[1]
print (data)
   user  rating      id     a
0     1     3.5  1_1193  1193
1     1     3.5   1_661   661
2     1     3.5   1_914   914
3     1     3.5  1_3408  3408
4     1     3.5  1_2355  2355

print (data.dtypes)
user        int64
rating    float64
id         object
a          object <- format is object (obviously string)
dtype: object
#split and cast column to int
data['a'] = data.id.str.split('_').str[1].astype(int)
print (data)
   user  rating      id     a
0     1     3.5  1_1193  1193
1     1     3.5   1_661   661
2     1     3.5   1_914   914
3     1     3.5  1_3408  3408
4     1     3.5  1_2355  2355

print (data.dtypes)
user        int64
rating    float64
id         object
a           int32 <- format is int
dtype: object

Also if need replace id column by new values:

data.id = data.id.str.split('_').str[1]
print (data)
   user  rating    id
0     1     3.5  1193
1     1     3.5   661
2     1     3.5   914
3     1     3.5  3408
4     1     3.5  2355

data.id = data.id.str.split('_').str.get(1)
print (data)
   user  rating    id
0     1     3.5  1193
1     1     3.5   661
2     1     3.5   914
3     1     3.5  3408
4     1     3.5  2355

Upvotes: 2

piRSquared
piRSquared

Reputation: 294586

A couple more options...

1
str.extract

df.id.str.extract('.*_(.*)', expand=False)

2
str.replace

df.id.str.replace('.*_', '')

Both Yield

0    1193
1     661
2     914
3    3408
4    2355
Name: id, dtype: object

Upvotes: 1

Related Questions