Reputation: 1
my project is to get some compounds and small molecules that relate a specific kinase ensyme from zinc and other databases. i have try several ways to download smiles from zincid or pubchem id in zinc database. none of them worked and now i want to use smi file and i don't know how to open it like a dataframe.
my code is
with open(smiles_file_path, 'r') as f:
smiles_list = f.readlines()
# Create a DataFrame from the list of SMILES strings
df = pd.DataFrame({'SMILES': smiles_list})
# Display the DataFrame
print(df)
but it doesn't show data frame properly!
Upvotes: 0
Views: 148
Reputation: 1869
smi-files are csv-files with SMILES in the first column and an optional second column separated by whitespaces (blank or tab). Zinc uses the second column for the ID.
You can open the files directly with pandas.
import pandas as pd
df = pd.read_csv('substances.smi', sep='\s+') # \s+ is the regex for whitespaces
df.columns = ['SMILES', 'ID']
print(df)
SMILES ID
0 N[C@H](CCc1ccc(N(CCCl)CCCl)cc1)C(=O)O ZINC000016090786
1 N[C@@H](CCCc1ccc(N(CCCl)CCCl)cc1)C(=O)O ZINC000002033385
2 N[C@H](CCCc1ccc(N(CCCl)CCCl)cc1)C(=O)O ZINC000001763088
3 N[C@@H](Cc1ccc(N(CCCl)CCCl)cc1)C(=O)O ZINC000000001673
4 N[C@H](Cc1ccc(N(CCCl)CCCl)cc1)C(=O)O ZINC000000001661
5 CCN(CC)c1ccc(CC[C@@H](N)C(=O)O)cc1 ZINC001951410564
6 CCN(CC)c1ccc(CC[C@H](N)C(=O)O)cc1 ZINC001951410565
Upvotes: 0