DataPlankton
DataPlankton

Reputation: 145

Converting pandas column of comma-separated strings into integers

I have a data frame that contains a column with comma separated values. I would like to convert the string values in that column to integers.

I am newish to coding in general so a brief explanation of what is happening would be massively appreciated. If you have time.

I have tried the following code.

df['col3'].str.strip(',').astype(int)

df
col1 col2 col3
1    x    12,123
2    x    1,123
3    y    45,998

df
col1 col2 col3
1    x    12123
2    x    1123
3    y    45998

Upvotes: 8

Views: 6648

Answers (4)

Jay Rajput
Jay Rajput

Reputation: 1888

All the answers talk about solving it after the data is read from the source like csv or excel. Another way to look at the problem is to normalize the data during reading from the source. Here is how you do when using read_csv or read_excel

pd.read_csv('your_file_name', thousands=',')
pd.read_excel('your/file/name', thousands=',')

See panda documentation read_excel and read_csv

Upvotes: 0

Karn Kumar
Karn Kumar

Reputation: 8816

There are already answers to this question but , i would like to add a another solution:

DataFrame:

>>> df
   col1 col2    col3
0     1    x  12,123
1     2    x   1,123
2     3    y  45,998

Try simplest by using str.replace method and you are all done:

>>> df['col3'] = df['col3'].str.replace(",", "")
# df['col3'] = df['col3'].str.replace(",", "").astype(int) <- cast to int
>>> df
   col1 col2   col3
0     1    x  12123
1     2    x   1123
2     3    y  45998

OR

another using df.replace along with regex method as Regex substitution is performed under the hood with re.sub. The rules for substitution for re.sub are the same.

>>> df['col3'] = df['col3'].replace(',', '', regex=True)
>>> df
   col1 col2   col3
0     1    x  12123
1     2    x   1123
2     3    y  45998

Upvotes: 8

Yuca
Yuca

Reputation: 6091

Brief explanation:

df['col3'].str.strip(',').str.join('').astype(int)
  • df['col3'] generates a pandas.Series from the values of col3
  • _______.str can be understood as a cast-to-string, usually means you would like to use a string method to the contents of your series
  • _____.str.strip(',') uses the strip method: break a string into substrings, using the separator provided as the parameter used to distinguish when one substring ends and when the next one begins
  • _____.str.strip(',').str.join('') takes the substrings generated by the split and concatenates them together (effectively you're just removing the separator)
  • ____.astype(int) casts your result to an int

Credit to nixon on including the join to generate the actual desired output. Hope this helps, happy coding!

Upvotes: 2

yatu
yatu

Reputation: 88236

I think your solution should actually be:

df['col3'] = df.col3.str.split(',').str.join('').astype(int)

    col1 col2   col3
0     1    x  12123
1     2    x   1123
2     3    y  45998

As str.strip only strips from the left and right sides.

Explanation

  • str: Allows for vectorized string functions for Series
  • split: Will split each element in the list according to some pattern, , in this case
  • join: will join elements in the now Series of lists with a passed delimeter, '' here as you want to create ints.

And finally .astype(int) to turn each string into an integer

Upvotes: 11

Related Questions