Reputation: 199
Currently I am reading in from a file and it is generating this file (output.txt
):
Atom nVa avgppm stddev delta
1.H1' 2 5.73649 0.00104651803616 1.0952e-06
1.H2' 1 4.85438
1.H8 1 8.05367
10.H1' 3 5.33823 0.136655138213 0.0186746268
10.H2' 1 4.20449
10.H5 3 5.27571333333 0.231624986634 0.0536501344333
10.H6 5 7.49485 0.0285124165935 0.0008129579
This is the code that reads generates this file (I am reading in from a text file to generate these values)
df = pd.read_csv(expAtoms, sep = ' ', header = None)
df.columns = ["Atom","ppm"]
gb = (df.groupby("Atom", as_index=False).agg({"ppm":["count","mean","std","var"]}).rename(columns={"count":"nVa", "mean":"avgppm","std":"stddev","var":"delta"}))
gb.head()
gb.columns = gb.columns.droplevel()
gb = gb.rename(columns={"":"Atom"})
gb.to_csv("output.txt", sep =" ", index=False)
In between my nVa
column and my avgppm
column, I want to insert another column called predppm
. I want to get the values from a file called file.txt
which looks like this:
5.H6 7.72158 0.3
6.H6 7.70272 0.3
7.H8 8.16859 0.3
1.H1' 7.65014 0.3
9.H8 8.1053 0.3
10.H6 7.5231 0.3
How can I check if the values in the first column of file.txt
= the values of the first column in output.txt
and if it does, insert the value from the second column of file.txt
into a column in my output file between the nVa column and the avgppm column?
For example, 1.H1'
is in output.txt and file.txt, so i would like to create a column called predppm
in my output.txt file and have the value 7.65014
(which comes from second column of file.txt) inserted for the 1.H1'
atom.
I think I understand how to add columns but only for functions that I can use with groupby, but I don't know how I can insert an arbitrary column into my output.
Upvotes: 2
Views: 63
Reputation: 476
The easiest way is to make an index
on the pandas.DataFrame
. Pandas has nice logic for matching up indexes.
from io import StringIO
import pandas as pd
# if python2, do:
# data = u"""\
data = """\
Atom nVa avgppm stddev delta
1.H1' 2 5.73649 0.00104651803616 1.0952e-06
1.H2' 1 4.85438
1.H8 1 8.05367
10.H1' 3 5.33823 0.136655138213 0.0186746268
10.H2' 1 4.20449
10.H5 3 5.27571333333 0.231624986634 0.0536501344333
10.H6 5 7.49485 0.0285124165935 0.0008129579
"""
# if python2, do:
# other_data = u"""\
other_data = """\
5.H6 7.72158 0.3
6.H6 7.70272 0.3
7.H8 8.16859 0.3
1.H1' 7.65014 0.3
9.H8 8.1053 0.3
10.H6 7.5231 0.3
"""
# setup these strings so they can be read by pd.read_csv
# (not necessary if these are actual files on disk)
data_file = StringIO(data)
other_data_file = StringIO(other_data)
# don't say header=None because the first row has the column names
df = pd.read_csv(data_file, sep=' ')
# set the index to 'Atom'
df = df.set_index('Atom')
# header=None because the other_data doesn't have header info
other_df = pd.read_csv(other_data_file, sep=' ', header=None)
# set the column names since they're not specified in other_data
other_df.columns = ['Atom', 'predppm', 'some_other_field']
# set the index to 'Atom'
other_df = other_df.set_index('Atom')
# this will assign other_df['predppm'] to the correct rows,
# because pandas uses the index when assigning new columns
df['predppm'] = other_df['predppm']
print(df)
# nVa avgppm stddev delta predppm
# Atom
# 1.H1' 2 5.736490 0.001047 0.000001 7.65014
# 1.H2' 1 4.854380 NaN NaN NaN
# 1.H8 1 8.053670 NaN NaN NaN
# 10.H1' 3 5.338230 0.136655 0.018675 NaN
# 10.H2' 1 4.204490 NaN NaN NaN
# 10.H5 3 5.275713 0.231625 0.053650 NaN
# 10.H6 5 7.494850 0.028512 0.000813 7.52310
# if you want to return 'Atom' to being a column:
df = df.reset_index()
Upvotes: 1