Reputation: 173
I am trying to put specific columns of tab separated files into a dictionary. I am trying several things and none give me the result I am looking for.
I have for example this file:
Name Start End Size
del1 100 105 5
del2 150 160 10
del3 250 300 50
and this file, both .csv
Name Qual StartB EndB Size
inv1 6 400 405 5
inv2 7 450 460 10
inv3 20 450 400 50
What I want is something like this, where Name is the key and the others are values, additionally I have the problem of changing headers and indexes of headers, but they mean the same thing:
del_dict{del1: {Start: 100, End: 105, Size:5} del2: {etc}
I tried reading the file in several ways, based on other stack overflow answers.
for file in glob.glob(directoryname + "/*.csv"):
dict = pd.read_csv(file, squeeze=False, sep="\t").to_dict()
print(dict)
and
for file in glob.glob(directoryname + "/*.csv"):
df = pd.read_csv(open(file, 'r'), header=0, sep="\t")
if "StartB" in df.keys():
name = df.Name
start_pos = df.StartB
end_pos = df.EndB
else:
name = df.Name
start_pos = df.Start
end_pos = df.End
But this gives me dataframes, that I cannot seem to fit in that into a dictionary.
I also tried this code, which I used before, but then it was only one file and no changing headers and then it will result in too many loops and hard coding to digest everything I need, based on the file I open.
for file in glob.glob(directoryname + "/*.csv"):
with open(file, 'r') as csvfile:
csv_list = []
for line in csvfile:
csv_list.append(line.strip("\t"))
I am fairly new to python, and I know a relatively simply answer must be available, but I cannot seem to find it. Sorry if the answer is already on stack overflow, I tried for hours to find a similar/workable problem and this is the point I am really getting stuck.
Upvotes: 3
Views: 497
Reputation: 173
Based on the help and expertise of @jezrael (marked as answer), I put my final code here, where I also merged the dictionaries in the format that I needed and found a solution to the column that do not exist in every file.
Please let me know if this is not the way to go on stack overflow.
csv_dict = {}
for file in glob.glob(directoryname + "/*.csv"):
df = pd.read_csv(file, sep="\t")
d = {'StartB': 'Start', 'EndB': 'End'}
if "Duplications" in df.keys():
d = df.set_index('Name').rename(columns=d)[['Start', 'End', 'Size', 'Duplications']].to_dict(orient='index')
csv_dict.update(d)
else:
d = df.set_index('Name').rename(columns=d)[['Start', 'End', 'Size']].to_dict(orient='index')
csv_dict.update(d)
print(csv_dict)
result:
{'del1': {'Start': 969261, 'End': 969270, 'Size': 10},
'del2': {'Start': 641573, 'End': 641672, 'Size': 100},
'del3': {'Start': 998620, 'End': 999119, 'Size': 500},
'dup1': {'Start': 595662, 'End': 595671, 'Size': 10, 'Duplications': 3},
'dup2': {'Start': 321225, 'End': 321324, 'Size': 100, 'Duplications': 3},
'dup3': {'Start': 971634, 'End': 972133, 'Size': 500, 'Duplications': 10},
'inv1': {'Start': 818450, 'End': 818459, 'Size': 10},
'inv2': {'Start': 991098, 'End': 991197, 'Size': 100},
'inv3': {'Start': 219635, 'End': 220134, 'Size': 500}}
Upvotes: 0
Reputation: 862611
I think need create index by column Name
with DataFrame.set_index
and then call DataFrame.to_dict
with parameter orient='index'
:
df = pd.read_csv(file, sep="\t")
d = df.set_index('Name').to_dict(orient='index')
print (d)
{'del1': {'Start': 100, 'End': 105, 'Size': 5},
'del2': {'Start': 150, 'End': 160, 'Size': 10},
'del3': {'Start': 250, 'End': 300, 'Size': 50}}
EDIT - You can rename columns names by dictionary and select by list columns for export to dict:
d = {'StartB':'Start','EndB':'End'}
d = df.set_index('Name').rename(columns=d)[['Start','End','Size']].to_dict(orient='index')
print (d)
{'inv1': {'Start': 400, 'End': 405, 'Size': 5},
'inv2': {'Start': 450, 'End': 460, 'Size': 10},
'inv3': {'Start': 450, 'End': 400, 'Size': 50}}
Upvotes: 1