Reputation: 373
I'm trying to extract specific values(in form of key:value pairs) from a pandas column which has multiple semicolon separated pairs.
The input dataframe looks like this:
9 114188457 114192289 cast_3_930|cast_1_1069|cast_2_985 0.9510007336163186 - 114188457 114188457 211,111,111 "gene_id ""ENSMUSG00000111734""; gene_version ""1""; transcript_id ""ENSMUST00000214237""; transcript_version ""1""; exon_number ""23""; gene_name ""Gm29825""; gene_source ""havana""; gene_biotype ""lincRNA""; havana_gene ""OTTMUSG00000062514""; havana_gene_version ""1""; transcript_name ""Gm29825-201""; transcript_source ""havana""; transcript_biotype ""lincRNA""; havana_transcript ""OTTMUST00000152298""; havana_transcript_version ""1""; exon_id ""ENSMUSE00001401544""; exon_version ""1""; tag ""basic""; transcript_support_level ""5"";" .
9 114227850 114241851 cast_3_932|cast_1_1071|cast_2_988 1.2516483862692769 + 114227850 114227850 211,111,111 "gene_id ""ENSMUSG00000064299""; gene_version ""6""; transcript_id ""ENSMUST00000213446""; transcript_version ""1""; exon_number ""26""; gene_name ""4921528I07Rik""; gene_source ""ensembl_havana""; gene_biotype ""processed_transcript""; havana_gene ""OTTMUSG00000062515""; havana_gene_version ""1""; transcript_name ""4921528I07Rik-202""; transcript_source ""havana""; transcript_biotype ""lincRNA""; havana_transcript ""OTTMUST00000152299""; havana_transcript_version ""1""; exon_id ""ENSMUSE00001400969""; exon_version ""1""; tag ""basic""; transcript_support_level ""1"";" .
9 114227850 114241851 cast_3_932|cast_1_1071|cast_2_988 1.2516483862692769 + 114227850 114227850 211,111,111 "gene_id ""ENSMUSG00000064299""; gene_version ""6""; transcript_id ""ENSMUST00000213446""; transcript_version ""1""; exon_number ""25""; gene_name ""4921528I07Rik""; gene_source ""ensembl_havana""; gene_biotype ""processed_transcript""; havana_gene ""OTTMUSG00000062515""; havana_gene_version ""1""; transcript_name ""4921528I07Rik-202""; transcript_source ""havana""; transcript_biotype ""lincRNA""; havana_transcript ""OTTMUST00000152299""; havana_transcript_version ""1""; exon_id ""ENSMUSE00001404576""; exon_version ""1""; tag ""basic""; transcript_support_level ""1"";" .
and I'm working on 10th column, which looks like this:
"gene_id ""ENSMUSG00000111734""; gene_version ""1""; transcript_id ""ENSMUST00000214237""; transcript_version ""1""; gene_name ""Gm29825""; gene_source ""havana""; gene_biotype ""lincRNA""; havana_gene ""OTTMUSG00000062514""; havana_gene_version ""1""; transcript_name ""Gm29825-201""; transcript_source ""havana""; transcript_biotype ""lincRNA""; havana_transcript ""OTTMUST00000152298""; havana_transcript_version ""1""; tag ""basic""; transcript_support_level ""5"";"
With pairs in format: identifier ""value""
While I can extract the values by converting that column into another dataframe and selecting the relevant rows, the problems is that the data in that column itself is not sorted properly.
I'm just interested in gene_id
, gene_name
and gene_biotype
in this case, but in future might alter the specifications on the required terms.
I could have worked on a dictionary based solution, but the values are not unique for each row, and in some rows they don't exist at all (rows with .
for column 10).
Ultimately, I want the dataframe to look like this:
9 114188457 114192289 cast_3_930|cast_1_1069|cast_2_985 0.9510007336163186 - 114188457 114188457 211,111,111 ENSMUSG00000111734 Gm29825 lincRNA .
9 114227850 114241851 cast_3_932|cast_1_1071|cast_2_988 1.2516483862692769 + 114227850 114227850 211,111,111 ENSMUSG00000064299 4921528I07Rik processed_transcript .
9 114227850 114241851 cast_3_932|cast_1_1071|cast_2_988 1.2516483862692769 + 114227850 114227850 211,111,111 ENSMUSG00000064299 4921528I07Rik processed_transcript .
What would be the most efficient way to do this in pandas ?
Upvotes: 0
Views: 60
Reputation: 954
You can use a regex expression after the .str
parameter on a column
df['gene_id'] = df.iloc[:,9].str.extract('gene_id \"(\w+)\";')
df['gene_name'] = df.iloc[:,9].str.extract('gene_name \"(\w+)\";')
df['gene_biotype'] =df.iloc[:,9].str.extract('gene_biotype \"(\w+)\";')
Upvotes: 1