Reputation: 153
my first time writing python script. I had read through many questions and answers in stack overflow, but still didn't figure out where I got wrong in my code. Probably I could ask for help?
I have a file file.txt as below, with uncertain number of column in each line. "\t" is the tab delimiter
$ head file.txt
AA:d23\tBB:4r3w\tCC:e5t
BB:435\tCC:w4w
AA:w4r\tCC:2342
AA:34534\tBB:e5\tCC:7uf
BB:e4t4
I would like to turn it into a data_frame like .txt file, which has three columns in each row by adding NA for the missing column. Also, I would like to eliminate the rows that only have one entry (e.g. only AA or only BB or only CC). So expected output like below:
AA:d23\tBB:4r3w\tCC:e5t
AA:NA\tBB:435\tCC:w4w
AA:w4r\tBB:NA\tCC:2342
AA:34534\tBB:e5\tCC:7uf
#(the 5th line is omitted here because it only has one entry)
After studying many examples on the forum,I mimicked some codes, and wrote my own code as below:
#!/usr/bin/env python3
#into_data.py
import csv
import fileinput
def into_data(a_file):
output=[]
for row in a_file:
if "AA:" in row and "BB:" in row and "CC:" in row:
output.append(row)
elif "AA:" not in row and "BB:" in row and "CC:" in row:
output.append("AA:NA" + row)
elif "AA:" in row and "BB:" not in row and "CC:" in row:
output.append(row.split("\t")[0] + "CB:NA" + row.split("\t")[1])
elif "AA:" in row and "BB:" in row and "CC:" not in row:
output.append(row + "CC:NA")
reader = csv.reader(fileinput.input(), delimiter="\t")
print(into_data(reader))
#outside python script
python3 into_data.py file.txt > output.txt
But I get "None" in my output.txt. I don't really understand why. Could you please be so kind to point my error out? Thanks a lot in advance!
Upvotes: 0
Views: 168
Reputation: 37877
Since you evoke dataframes/pandas
, here is a proposition :
import pandas as pd
import numpy as np
df = pd.read_csv('test.txt', header=None, sep=r'\\t', engine='python')
m = df.notnull().sum(axis=1).eq(1)
#does the row has a single entry ?
df = df.loc[~m]
out = (df.stack().reset_index(name='val')
.assign(col=lambda x: x['val'].str.slice(0,2))
.pivot(index='level_0', columns='col', values='val')
.reset_index(drop=True)
)
out[:] = np.where(out.isna(), [out.columns + ':NA'], out)
out.columns = df.columns
out.to_csv('final.txt', sep='\t', header=None, index=False)
print(out)
0 1 2
0 AA:d23 BB:4r3w CC:e5t
1 AA:NA BB:435 CC:w4w
2 AA:w4r BB:NA CC:2342
3 AA:34534 BB:e5 CC:7uf
Upvotes: 1
Reputation: 13533
Here's a pretty straight forward way that should handle all cases.
# a_file is a csv.reader
def into_data(a_file):
output = []
for row in a_file:
if len(row) <= 1: continue
if not row[0].startswith("AA"):
row.insert(0, "AA:NA")
if not row[1].startswith("BB"):
row.insert(1, "BB:NA")
if len(row) < 3:
row.append("CC:NA")
output.append(row)
return output
This creates a list of lists. If you want a list of strings, change the line output.append(row)
to
output.append('\t'.join(row))
Upvotes: 1
Reputation: 96
Your function doesn't have a return statement. Therefore it defaults to returning None
.
You need to return the output
at the end of the function like this:
def into_data(a_file):
output=[]
for row in a_file:
if "AA:" in row and "BB:" in row and "CC:" in row:
output.append(row)
elif "AA:" not in row and "BB:" in row and "CC:" in row:
output.append("AA:NA" + row)
elif "AA:" in row and "BB:" not in row and "CC:" in row:
output.append(row.split("\t")[0] + "CB:NA" + row.split("\t")[1])
elif "AA:" in row and "BB:" in row and "CC:" not in row:
output.append(row + "CC:NA")
return output
To edit the rows as you specify, you can try this:
def into_data(a_file):
output = []
for row in a_file:
if len(row) <= 1:
continue
if "AA:" not in row[0]:
row.insert(0, "AA:NA")
if "BB:" not in row[1]:
row.insert(1, "BB:NA")
if len(row) < 3:
row.insert(2, "CC:NA")
output.append(row)
return output
Upvotes: 1