J.Z.
J.Z.

Reputation: 59

Best strategy to process this text file into a structured table?

My collaborator wants me to process the a input text file into a structured table:

The raw input text file looks like

PMID    22224631
Title    -765 G_C and -1195 A_G promoter variants of the cyclooxygenase-2 gene decrease the risk for preeclampsia.
Found 8 gene(s) 
Gene     CRP Mentions
Gene     GC Mentions
Gene     PTGS2 Mentions
Found 1 variant(s)  
Variant  I399V URL
Gene     PTGS1 Mentions
Found 2 variant(s)  
Variant  L255L URL
Variant  V255V URL
Gene     CT49 Mentions
Gene     GAA Mentions
Found 1 variant(s)  
Variant  Q255H URL
Gene     CGA Mentions
Gene     TAT Mentions

PMID    16076618
Title    1166C mutation of angiotensin II type 1 receptor gene is correlated with umbilical blood flow velocimetry in women with preeclampsia.
Found 13 gene(s)    
Gene     AGTR2 Mentions
Gene     QTRT1 Mentions
Gene     SLC25A10 Mentions
Gene     ATM Mentions
Gene     PIH Mentions
Gene     CCL14 Mentions
Gene     AGT Mentions
Gene     REN Mentions
Gene     ASAH1 Mentions
Gene     AGTR1 Mentions
Gene     SSD Mentions
Gene     TAT Mentions
Found 1 variant(s)  
Variant  D389A URL
Gene     ACE Mentions
Found 2 variant(s)  
Variant  D389A URL
Variant  H389P URL

You can see, for each PMID (an id for scientific publication), there is some information about genes, for each gene, there may be some information about variants. The input text pretty much like a "print" function output instead of a table. Then each PMID block is separated by a empty line.

The final table the collaborate wants is like a long format table (.csv) comprised of three layers: PMID, gene and variant. PMID contains genes, the genes contains (or not) variants. Take the example from above input file:

PMID     |   Gene   |  Variant
22224631 |   CRP    | No
22224631 |   GC     | No
22224631 |   PTGS2  | I399V 
22224631 |   PTGS1  | L255L 
22224631 |   PTGS1  | V255V 
22224631 |   CT49   | No 
22224631 |   GAA    | Q255H
.......  |  .....

I do not have much experience processing raw text file to tables in Python.

My thinking is using regex to strip redundant words first. I try to read in this text file, it generate a big list of strings, in which each string is a line in input file

with open ("gene and variants.txt", "r") as myfile:
    data=myfile.readlines()

data2 = [x for x in data if not x.startswith('Title') and not 
x.startswith('Found')]
data3 = [x for x in data2 if x != " \t\n"]
data4 = [x.strip(" Mentions\n") for x in data3]
data4 = [x.strip(" URL") for x in data4]
data4 = [x.replace("Gene\t", "Gene") for x in data4]
data4 = [x.replace("PMID\t", "PMID ") for x in data4]
data4 = [x.replace("Variant\t", "Variant") for x in data4]

Luckily, I am able to strip most unnecessary information, finally get to this list of string:

The list of string like this:

The list of string like this

Then I got stuck.... what to do next to convert this string list to my target table? I was thinking use Pandas, but it seems only take each string as a row in dataframe with a single column.

Am I on the right path? If so, what should I do next?

If not, do you have any suggestion on how should I approach this problem?

Upvotes: 1

Views: 719

Answers (3)

James Dellinger
James Dellinger

Reputation: 1261

You can follow these steps to convert your text file into a Pandas dataframe with the desired format:

  1. Use read_csv() to import the text file. To test out, I copied the raw input text you had pasted in above to a new text file and saved it as raw_input.txt:
df = pd.read_csv('raw_input.txt', header=-1)

The dataframe will contain a bunch of rows formatted like this:

    0
0   PMID 22224631
1   Title -765 G_C and -1195 A_G promoter varia...
2   Found 8 gene(s)
3   Gene CRP Mentions
4   Gene GC Mentions
5   Gene PTGS2 Mentions
6   Found 1 variant(s)
7   Variant I399V URL
8   Gene PTGS1 Mentions
...
  1. Our next step is to create a dictionary that stores the info for each PMID:
# Get the indices of each row that has a new PMID header
pmid_idxs = df[df[0].str.contains('PMID')].index

# Now construct the dictionary, using each PMID as a key and 
# filling the entry for each key with the PMID's gene info.
pmid_dict = {}
for i, val in enumerate(pmid_idxs.values):
    if pmid_idxs.values[-1] != val:
        nxt_pmid_idx = pmid_idxs.values[i+1]
        pmid_dict[df[0].iloc[val]] =  df[0].iloc[val+1:nxt_pmid_idx].reset_index(drop=True)
    else: # if last PMID
        pmid_dict[df[0].iloc[val]] =  df[0].iloc[val+1:].reset_index(drop=True)
  1. Now for the main part -- this is the logic that will loop through each entry in the dictionary, extract and format each PMID's gene info into a small dataframe, and add that dataframe to a list:
df_list = []

for key, value in pmid_dict.items():
    pmid_num = ''.join(c for c in key if c not in 'PMID ')
    series = value
    next_rows = series.shift(-1).fillna('placeholder')
    df_dict = {'PMID': [],
               'Gene': [],
               'Variant': []}
    gene = ''
    variant = ''
    for i, row in series.iteritems():
        if 'Gene' in row:
            gene = row[4:-9].strip(' ')
            if i <= (len(series)) and 'variant' not in next_rows.iloc[i].lower():
                df_dict['PMID'].append(pmid_num)
                df_dict['Gene'].append(gene)
                df_dict['Variant'].append('No')
            elif i == len(series) + 1:
                df_dict['PMID'].append(pmid_num)
                df_dict['Gene'].append(gene)
                df_dict['Variant'].append('No')
        if 'Variant' in row:
            variant = row[8:-4].strip(' ')
            df_dict['PMID'].append(pmid_num)
            df_dict['Gene'].append(gene)
            df_dict['Variant'].append(variant)

    df = pd.DataFrame(df_dict)
    df_list.append(df)
  1. The final output dataframe will merely be a concatenation of each small dataframe we created above:
output_df = pd.concat(df_list).reset_index(drop=True)

And that's it. The output dataframe looks like this, which I believe is your desired format:

    PMID        Gene      Variant
0   22224631    CRP       No
1   22224631    GC        No
2   22224631    PTGS2     I399V
3   22224631    PTGS1     L255L
4   22224631    PTGS1     V255V
5   22224631    CT49      No
6   22224631    GAA       Q255H
7   22224631    CGA       No
8   22224631    TAT       No
9   16076618    AGTR2     No
10  16076618    QTRT1     No
11  16076618    SLC25A10  No
12  16076618    ATM       No
13  16076618    PIH       No
14  16076618    CCL14     No
15  16076618    AGT       No
16  16076618    REN       No
17  16076618    ASAH1     No
18  16076618    AGTR1     No
19  16076618    SSD       No
20  16076618    TAT       D389A
21  16076618    ACE       D389A
22  16076618    ACE       H389P

Upvotes: 1

Al Gebra
Al Gebra

Reputation: 63

You could work with dictionaries.

For example:

fileDict =  {Gene : [], Variant: [], PMID: []}

Iterate through the list and check if Gene, Variant or PMID and append the values.

You can then do like

for x in fileDict['Gene']:
    print(x)

Upvotes: 0

user8287033
user8287033

Reputation:

I am not REALLY experienced in Python, but my approach would be to create tuples.
First one created manually, to make that first PMID | Gene | Variant part,
then using regex to strip unnessesary text and go adding those tuples in a single list.
Then printing them all using String formatting.
Or, you could make 3 lists, one for PMID, one for Gene, one for Variant. Then iterating them with a forloop and printing them out to create that table.
Sorry for not being able to give specific tips.
Best of wishes!

Upvotes: 0

Related Questions