Reputation: 11
I have a dataframe in which one of the columns has a long list of semi-colon separated strings:
gene_id ENSGACG00000019161; gene_version 1; transcript_id ENSGACT00000025386; transcript_version 1; exon_number 9; gene_name slc7a8a; gene_source ensembl; gene_biotype protein_coding; transcript_name slc7a8a-203; transcript_source ensembl; transcript_biotype protein_coding; exon_id ENSGACE00000225405; exon_version 1;
I want to somehow go row by row and pull out just the string that follows gene_name and precedes the semi-colon. So in this case slc7a8a. I'm sorry if this is either a simple question or a repetitive one. I've tried to look through multiple resources but don't even know the most concise way to describe what I want to do had difficulty finding anything helpful.
Thank you
Upvotes: 1
Views: 31
Reputation: 1614
You can use pandas str.extract which takes a regex pattern as an input parameter:
df['col_name'].str.extract('gene_name(.*?);')
Upvotes: 1