Reputation: 704
Using python and pandas as pd, I am trying to OUTPUT a file that has a subset of columns based on specific headers.
Here is an example of an input file
gene_input = pd.read_table(args.gene, sep="\t" ,index_col=0)
The structure of gene_input:
Sample1 Sample2 Sample3 Sample4 Sample5 Sample6 Sample7 Sample8 Gene1 2 23 213 213 13 132 213 4312 Gene2 3 12 21312 123 123 23 4321 432 Gene3 5 213 21312 15 516 3421 4312 4132 Gene4 2 123 123 7 610 23 3214 4312 Gene5 1 213 213 1 152 23 1423 3421
Using a different loop, I generated TWO dictionaries. The first one has the keys (Sample 1 and Sample 7) and the second has the keys (Sample 4 and 8).
I would like to have the following output (Note that I want the samples from each of the dictionaries to be consecutive; i.e. all Dictionary 1 first, then all Dictionary 2): The output that I am looking for is:
Sample1 Sample7 Sample4 Sample8 Gene1 2 213 213 4312 Gene2 3 4321 123 432 Gene3 5 4312 15 4132 Gene4 2 3214 7 4312 Gene5 1 1423 1 3421
I have tried the following but none worked:
key_num=list(dictionary1.keys())
num = genes_input[gene_input.columns.isin(key_num)]
In order to extract the first set of columns then somehow combine it, but that failed. It kept giving me attributes error, and i did update pandas. I also tried the following:
reader = csv.reader( open(gene_input, 'rU'), delimiter='\t')
header_row = reader.next() # Gets the header
for key, value in numerator.items():
output.write(key + "\t")
if key in header_row:
for row in reader:
idx=header_row.index(key)
output.write(idx +"\t")
as well as some other commands/loops/lines. Sometimes i only get the first key only to be in the output, other times i get an error; depending on which method i tried (i am not listing them all here for sake of convenience).
Anyway, if anyone has any input on how I can generate the output file of interest, I'd be grateful.
Again, here is what I want as a final output:
Sample1 Sample7 Sample4 Sample8 Gene1 2 213 213 4312 Gene2 3 4321 123 432 Gene3 5 4312 15 4132 Gene4 2 3214 7 4312 Gene5 1 1423 1 3421
Upvotes: 1
Views: 8735
Reputation: 3316
For a specific set of columns in a specific order, use:
df = gene_input[['Sample1', 'Sample2', 'Sample4', 'Sample7']]
If you need to make that list (['Sample1',...]) automatically, and the names are as given, you should be able to build the two lists, combine them and then sort:
column_names = sorted(dictionary1.keys() + dictionary2.keys())
The names that you have should sort correctly. For output, you should be able to use:
df.to_csv(<output file name>, sep='\t')
EDIT: added part about output
Upvotes: 4