Reputation: 59
My collaborator wants me to process the a input text file into a structured table:
The raw input text file looks like
PMID 22224631
Title -765 G_C and -1195 A_G promoter variants of the cyclooxygenase-2 gene decrease the risk for preeclampsia.
Found 8 gene(s)
Gene CRP Mentions
Gene GC Mentions
Gene PTGS2 Mentions
Found 1 variant(s)
Variant I399V URL
Gene PTGS1 Mentions
Found 2 variant(s)
Variant L255L URL
Variant V255V URL
Gene CT49 Mentions
Gene GAA Mentions
Found 1 variant(s)
Variant Q255H URL
Gene CGA Mentions
Gene TAT Mentions
PMID 16076618
Title 1166C mutation of angiotensin II type 1 receptor gene is correlated with umbilical blood flow velocimetry in women with preeclampsia.
Found 13 gene(s)
Gene AGTR2 Mentions
Gene QTRT1 Mentions
Gene SLC25A10 Mentions
Gene ATM Mentions
Gene PIH Mentions
Gene CCL14 Mentions
Gene AGT Mentions
Gene REN Mentions
Gene ASAH1 Mentions
Gene AGTR1 Mentions
Gene SSD Mentions
Gene TAT Mentions
Found 1 variant(s)
Variant D389A URL
Gene ACE Mentions
Found 2 variant(s)
Variant D389A URL
Variant H389P URL
You can see, for each PMID (an id for scientific publication), there is some information about genes, for each gene, there may be some information about variants. The input text pretty much like a "print" function output instead of a table. Then each PMID block is separated by a empty line.
The final table the collaborate wants is like a long format table (.csv) comprised of three layers: PMID, gene and variant. PMID contains genes, the genes contains (or not) variants. Take the example from above input file:
PMID | Gene | Variant
22224631 | CRP | No
22224631 | GC | No
22224631 | PTGS2 | I399V
22224631 | PTGS1 | L255L
22224631 | PTGS1 | V255V
22224631 | CT49 | No
22224631 | GAA | Q255H
....... | .....
I do not have much experience processing raw text file to tables in Python.
My thinking is using regex to strip redundant words first. I try to read in this text file, it generate a big list of strings, in which each string is a line in input file
with open ("gene and variants.txt", "r") as myfile:
data=myfile.readlines()
data2 = [x for x in data if not x.startswith('Title') and not
x.startswith('Found')]
data3 = [x for x in data2 if x != " \t\n"]
data4 = [x.strip(" Mentions\n") for x in data3]
data4 = [x.strip(" URL") for x in data4]
data4 = [x.replace("Gene\t", "Gene") for x in data4]
data4 = [x.replace("PMID\t", "PMID ") for x in data4]
data4 = [x.replace("Variant\t", "Variant") for x in data4]
Luckily, I am able to strip most unnecessary information, finally get to this list of string:
The list of string like this:
Then I got stuck.... what to do next to convert this string list to my target table? I was thinking use Pandas, but it seems only take each string as a row in dataframe with a single column.
Am I on the right path? If so, what should I do next?
If not, do you have any suggestion on how should I approach this problem?
Upvotes: 1
Views: 719
Reputation: 1261
You can follow these steps to convert your text file into a Pandas dataframe with the desired format:
read_csv()
to import the text file. To test out, I copied the raw input text you had pasted in above to a new text file and saved it as raw_input.txt
:df = pd.read_csv('raw_input.txt', header=-1)
The dataframe will contain a bunch of rows formatted like this:
0
0 PMID 22224631
1 Title -765 G_C and -1195 A_G promoter varia...
2 Found 8 gene(s)
3 Gene CRP Mentions
4 Gene GC Mentions
5 Gene PTGS2 Mentions
6 Found 1 variant(s)
7 Variant I399V URL
8 Gene PTGS1 Mentions
...
PMID
:# Get the indices of each row that has a new PMID header
pmid_idxs = df[df[0].str.contains('PMID')].index
# Now construct the dictionary, using each PMID as a key and
# filling the entry for each key with the PMID's gene info.
pmid_dict = {}
for i, val in enumerate(pmid_idxs.values):
if pmid_idxs.values[-1] != val:
nxt_pmid_idx = pmid_idxs.values[i+1]
pmid_dict[df[0].iloc[val]] = df[0].iloc[val+1:nxt_pmid_idx].reset_index(drop=True)
else: # if last PMID
pmid_dict[df[0].iloc[val]] = df[0].iloc[val+1:].reset_index(drop=True)
df_list = []
for key, value in pmid_dict.items():
pmid_num = ''.join(c for c in key if c not in 'PMID ')
series = value
next_rows = series.shift(-1).fillna('placeholder')
df_dict = {'PMID': [],
'Gene': [],
'Variant': []}
gene = ''
variant = ''
for i, row in series.iteritems():
if 'Gene' in row:
gene = row[4:-9].strip(' ')
if i <= (len(series)) and 'variant' not in next_rows.iloc[i].lower():
df_dict['PMID'].append(pmid_num)
df_dict['Gene'].append(gene)
df_dict['Variant'].append('No')
elif i == len(series) + 1:
df_dict['PMID'].append(pmid_num)
df_dict['Gene'].append(gene)
df_dict['Variant'].append('No')
if 'Variant' in row:
variant = row[8:-4].strip(' ')
df_dict['PMID'].append(pmid_num)
df_dict['Gene'].append(gene)
df_dict['Variant'].append(variant)
df = pd.DataFrame(df_dict)
df_list.append(df)
output_df = pd.concat(df_list).reset_index(drop=True)
And that's it. The output dataframe looks like this, which I believe is your desired format:
PMID Gene Variant
0 22224631 CRP No
1 22224631 GC No
2 22224631 PTGS2 I399V
3 22224631 PTGS1 L255L
4 22224631 PTGS1 V255V
5 22224631 CT49 No
6 22224631 GAA Q255H
7 22224631 CGA No
8 22224631 TAT No
9 16076618 AGTR2 No
10 16076618 QTRT1 No
11 16076618 SLC25A10 No
12 16076618 ATM No
13 16076618 PIH No
14 16076618 CCL14 No
15 16076618 AGT No
16 16076618 REN No
17 16076618 ASAH1 No
18 16076618 AGTR1 No
19 16076618 SSD No
20 16076618 TAT D389A
21 16076618 ACE D389A
22 16076618 ACE H389P
Upvotes: 1
Reputation: 63
You could work with dictionaries.
For example:
fileDict = {Gene : [], Variant: [], PMID: []}
Iterate through the list and check if Gene, Variant or PMID and append the values.
You can then do like
for x in fileDict['Gene']:
print(x)
Upvotes: 0
Reputation:
I am not REALLY experienced in Python, but my approach would be to create tuples.
First one created manually, to make that first PMID | Gene | Variant
part,
then using regex to strip unnessesary text and go adding those tuples in a single list.
Then printing them all using String formatting.
Or, you could make 3 lists, one for PMID, one for Gene, one for Variant.
Then iterating them with a forloop and printing them out to create that table.
Sorry for not being able to give specific tips.
Best of wishes!
Upvotes: 0