eb0906
eb0906

Reputation: 85

pandas split values in column

I'm new to pandas (version 1.1.5) and have tried str.split() and str.extract() to split column POS of numerical values with no success. My dataframe is about 3000 lines and is structured like this (note _ and - delimiters in subset):

df.head()

      SAMPLE CHROM        POS REF ALT
1  Sample1     7    105121514       C       T
2  Sample2    17    7359940         C       A
3  Sample3     X    76777781        A       G
4  Sample4    16    70531965-70531965       C       G
5  Sample5     6    26093141-26093141       G       A
6  Sample6    12    11905465        C       T
7  Sample7     4    103527484_103527848       G       A

I would like for the dataframe to look like this (i.e. retain values preceding all delimiters):

      SAMPLE CHROM        POS REF ALT
1  Sample1     7    105121514       C       T
2  Sample2    17    7359940         C       A
3  Sample3     X    76777781        A       G
4  Sample4    16    70531965        C       G
5  Sample5     6    26093141        G       A
6  Sample6    12    11905465        C       T
7  Sample7     4    103527484       G       A

My attempts have either split the rows only containing a delimiter and dropping all other rows, dropping all rows containing just the delimiters, or dropping all values.

For example, df['POS'] = df['POS'].str.replace(r'[-|_]\d+', '') outputs:

      SAMPLE CHROM  POS REF ALT
1  Sample1     7    NaN   C   T
2  Sample2    17    NaN   C   A
3  Sample3     X    NaN   A   G
4  Sample4    16    NaN   C   G
5  Sample5     6    NaN   G   A
6  Sample6    12    NaN   C   T
7  Sample7     4    NaN   G   A

Accepting the solution from @PaulS below as I needed to convert the column datatype from object to string first in order for str.replace() to work!

df.dtypes

SAMPLE    object
CHROM     object
POS       object
REF       object
ALT       object
dtype: object

df['POS'] = df['POS'].astype('str')
df['POS'] = df['POS'].str.replace(r'[-|_]\d+', '')

Upvotes: 0

Views: 1828

Answers (3)

PaulS
PaulS

Reputation: 25323

A possible solution, based on the idea of replacing all characters after _ or - (inclusive) with the empty string (''):

df['POS'] = df['POS'].str.replace(r'[-_]\d+', '')

Output:

  CHROM        POS REF ALT
0     7  105121514   C   T
1    17    7359940   C   A
2     X   76777781   A   G
3    16   70531965   C   G
4     6   26093141   G   A
5    12   11905465   C   T
6     4  103527484   G   A

Upvotes: 0

scotscotmcc
scotscotmcc

Reputation: 3113

If you are on pandas >= 1.4, you can use a regex with str.split(). Combine this with expand=True and then just take the first result and I think you've got what you need.

df['POS'] = df['POS'].str.split('[-_]',expand=True,regex=True)[0]

Upvotes: 0

Mato
Mato

Reputation: 54

This is not the most popular solution, but you can try.

df.POS = df.POS.str.replace("-", " ")
df.POS = df.POS.str.replace("_", " ")
df.POS = df.POS.str.split()
df.POS = [x[0] for x in df.POS]

Upvotes: 1

Related Questions