MStep
MStep

Reputation: 11

Python - search for pattern within a DataFrame followed by multiple possible strings

I have a dataframe in which one of the columns has a long list of semi-colon separated strings:

gene_id ENSGACG00000019161; gene_version 1; transcript_id ENSGACT00000025386; transcript_version 1; exon_number 9; gene_name slc7a8a; gene_source ensembl; gene_biotype protein_coding; transcript_name slc7a8a-203; transcript_source ensembl; transcript_biotype protein_coding; exon_id ENSGACE00000225405; exon_version 1;

I want to somehow go row by row and pull out just the string that follows gene_name and precedes the semi-colon. So in this case slc7a8a. I'm sorry if this is either a simple question or a repetitive one. I've tried to look through multiple resources but don't even know the most concise way to describe what I want to do had difficulty finding anything helpful.

Thank you

Upvotes: 1

Views: 31

Answers (1)

panktijk
panktijk

Reputation: 1614

You can use pandas str.extract which takes a regex pattern as an input parameter:

df['col_name'].str.extract('gene_name(.*?);')

Upvotes: 1

Related Questions