Pandas extracting values from rows based on set of strings

Question

I'm trying to extract specific values(in form of key:value pairs) from a pandas column which has multiple semicolon separated pairs.

The input dataframe looks like this:

9   114188457   114192289   cast_3_930|cast_1_1069|cast_2_985   0.9510007336163186  -   114188457   114188457   211,111,111 "gene_id ""ENSMUSG00000111734""; gene_version ""1""; transcript_id ""ENSMUST00000214237""; transcript_version ""1""; exon_number ""23""; gene_name ""Gm29825""; gene_source ""havana""; gene_biotype ""lincRNA""; havana_gene ""OTTMUSG00000062514""; havana_gene_version ""1""; transcript_name ""Gm29825-201""; transcript_source ""havana""; transcript_biotype ""lincRNA""; havana_transcript ""OTTMUST00000152298""; havana_transcript_version ""1""; exon_id ""ENSMUSE00001401544""; exon_version ""1""; tag ""basic""; transcript_support_level ""5"";"  .
9   114227850   114241851   cast_3_932|cast_1_1071|cast_2_988   1.2516483862692769  +   114227850   114227850   211,111,111 "gene_id ""ENSMUSG00000064299""; gene_version ""6""; transcript_id ""ENSMUST00000213446""; transcript_version ""1""; exon_number ""26""; gene_name ""4921528I07Rik""; gene_source ""ensembl_havana""; gene_biotype ""processed_transcript""; havana_gene ""OTTMUSG00000062515""; havana_gene_version ""1""; transcript_name ""4921528I07Rik-202""; transcript_source ""havana""; transcript_biotype ""lincRNA""; havana_transcript ""OTTMUST00000152299""; havana_transcript_version ""1""; exon_id ""ENSMUSE00001400969""; exon_version ""1""; tag ""basic""; transcript_support_level ""1"";" .
9   114227850   114241851   cast_3_932|cast_1_1071|cast_2_988   1.2516483862692769  +   114227850   114227850   211,111,111 "gene_id ""ENSMUSG00000064299""; gene_version ""6""; transcript_id ""ENSMUST00000213446""; transcript_version ""1""; exon_number ""25""; gene_name ""4921528I07Rik""; gene_source ""ensembl_havana""; gene_biotype ""processed_transcript""; havana_gene ""OTTMUSG00000062515""; havana_gene_version ""1""; transcript_name ""4921528I07Rik-202""; transcript_source ""havana""; transcript_biotype ""lincRNA""; havana_transcript ""OTTMUST00000152299""; havana_transcript_version ""1""; exon_id ""ENSMUSE00001404576""; exon_version ""1""; tag ""basic""; transcript_support_level ""1"";" .

and I'm working on 10th column, which looks like this:

"gene_id ""ENSMUSG00000111734""; gene_version ""1""; transcript_id ""ENSMUST00000214237""; transcript_version ""1""; gene_name ""Gm29825""; gene_source ""havana""; gene_biotype ""lincRNA""; havana_gene ""OTTMUSG00000062514""; havana_gene_version ""1""; transcript_name ""Gm29825-201""; transcript_source ""havana""; transcript_biotype ""lincRNA""; havana_transcript ""OTTMUST00000152298""; havana_transcript_version ""1""; tag ""basic""; transcript_support_level ""5"";"

With pairs in format: identifier ""value""

While I can extract the values by converting that column into another dataframe and selecting the relevant rows, the problems is that the data in that column itself is not sorted properly.

I'm just interested in gene_id, gene_name and gene_biotype in this case, but in future might alter the specifications on the required terms. I could have worked on a dictionary based solution, but the values are not unique for each row, and in some rows they don't exist at all (rows with . for column 10).

Ultimately, I want the dataframe to look like this:

9   114188457   114192289   cast_3_930|cast_1_1069|cast_2_985   0.9510007336163186  -   114188457   114188457   211,111,111 ENSMUSG00000111734  Gm29825 lincRNA .
9   114227850   114241851   cast_3_932|cast_1_1071|cast_2_988   1.2516483862692769  +   114227850   114227850   211,111,111 ENSMUSG00000064299  4921528I07Rik   processed_transcript    .
9   114227850   114241851   cast_3_932|cast_1_1071|cast_2_988   1.2516483862692769  +   114227850   114227850   211,111,111 ENSMUSG00000064299  4921528I07Rik   processed_transcript    .

What would be the most efficient way to do this in pandas ?

michaelg · Accepted Answer

Regex on pandas column

You can use a regex expression after the .str parameter on a column

df['gene_id'] = df.iloc[:,9].str.extract('gene_id \"(\w+)\";')
df['gene_name'] = df.iloc[:,9].str.extract('gene_name \"(\w+)\";')
df['gene_biotype'] =df.iloc[:,9].str.extract('gene_biotype \"(\w+)\";')

Pandas extracting values from rows based on set of strings

Answers (1)

Regex on pandas column

Related Questions