Reputation: 11765
I have this code using Pandas in Python:
all_data = {}
for ticker in ['FIUIX', 'FSAIX', 'FSAVX', 'FSTMX']:
all_data[ticker] = web.get_data_yahoo(ticker, '1/1/2010', '1/1/2015')
prices = DataFrame({tic: data['Adj Close'] for tic, data in all_data.iteritems()})
returns = prices.pct_change()
I know I can run a regression like this:
regs = sm.OLS(returns.FIUIX,returns.FSTMX).fit()
but how can I do this for each column in the dataframe? Specifically, how can I iterate over columns, in order to run the regression on each?
Specifically, I want to regress each other ticker symbol (FIUIX, FSAIX and FSAVX) on FSTMX, and store the residuals for each regression.
I've tried various versions of the following, but nothing I've tried gives the desired result:
resids = {}
for k in returns.keys():
reg = sm.OLS(returns[k],returns.FSTMX).fit()
resids[k] = reg.resid
Is there something wrong with the returns[k]
part of the code? How can I use the k
value to access a column? Or else is there a simpler approach?
Upvotes: 320
Views: 924157
Reputation: 5563
You can use items()
:
for name, values in df.items():
print('{name}: {value}'.format(name=name, value=values[0]))
For pandas < 2.0, you can use iteritems()
:
for name, values in df.iteritems():
print('{name}: {value}'.format(name=name, value=values[0]))
Upvotes: 128
Reputation: 12801
You can index dataframe columns by the position using ix
.
df1.ix[:,1]
The following returns the first column for example. (0 would be the index)
df1.ix[0,]
The following returns the first row.
df1.ix[:,1]
The following would be the value at the intersection of row 0 and column 1:
df1.ix[0,1]
and so on. So you can enumerate()
returns.keys():
and use the number to index the dataframe.
Upvotes: 24
Reputation: 91
If you care about performance, I have benchmarked some ways to iterate over columns.
If you just want the column names, fastest method is to iterate over df.columns.values
-- 51% faster than df.columns
, 86% faster than df
and a whopping 2500% faster than df.items()
.
Details are as below:
# DataFrame with 1000 rows and 26 columns (from 'a' to 'z')
df = pd.DataFrame(
np.random.randn(1000, 26),
columns=list('abcdefghijklmnopqrstuvwxyz')
)
# Method 1
for col_name, col in df.items():
...
98.5 μs ± 1.17 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
# Method 2
for col in df:
...
6.9 μs ± 35.3 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
# Method 3
for col in df.columns:
...
5.6 μs ± 55.3 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
# Method 4 (fastest)
for col in df.columns.values:
...
3.7 μs ± 38.8 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
Upvotes: 1
Reputation: 31998
Old answer:
for column in df:
print(df[column])
The previous answer still works, but was added around the time of pandas 0.16.0. Better versions are available.
Now you can do:
for series_name, series in df.items():
print(series_name)
print(series)
Upvotes: 584
Reputation: 5212
Most of these answers are going via the column name, rather than iterating the columns directly. They will also have issues if there are multiple columns with the same name. If you want to iterate the columns, I'd suggest:
for series in (df.iloc[:,i] for i in range(df.shape[1])):
...
Upvotes: 0
Reputation: 597
assuming X-factor, y-label (multicolumn):
columns = [c for c in _df.columns if c in ['col1', 'col2','col3']] #or '..c not in..'
_df.set_index(columns, inplace=True)
print( _df.index)
X, y = _df.iloc[:,:4].values, _df.index.values
Upvotes: -1
Reputation: 26261
I landed on this question as I was looking for a clean iterator of columns only (Series
, no names).
Unless I am mistaken, there is no such thing, which, if true, is a bit annoying. In particular, one would sometimes like to assign a few individual columns (Series) to variables, e.g.:
x, y = df[['x', 'y']] # does not work
There is df.items()
that gets close, but it gives an iterator of tuples (column_name, column_series)
. Interestingly, there is a corresponding df.keys()
which returns df.columns
, i.e. the column names as an Index
, so a, b = df[['x', 'y']].keys()
assigns properly a='x'
and b='y'
. But there is no corresponding df.values()
, and for good reason, as df.values
is a property and returns the underlying numpy
array.
One (inelegant) way is to do:
x, y = (v for _, v in df[['x', 'y']].items())
but it's less pythonic than I'd like.
Upvotes: 1
Reputation: 1958
This answer is to iterate over selected columns as well as all columns in a DF.
df.columns
gives a list containing all the columns' names in the DF. Now that isn't very helpful if you want to iterate over all the columns. But it comes in handy when you want to iterate over columns of your choosing only.
We can use Python's list slicing easily to slice df.columns according to our needs. For eg, to iterate over all columns but the first one, we can do:
for column in df.columns[1:]:
print(df[column])
Similarly to iterate over all the columns in reversed order, we can do:
for column in df.columns[::-1]:
print(df[column])
We can iterate over all the columns in a lot of cool ways using this technique. Also remember that you can get the indices of all columns easily using:
for ind, column in enumerate(df.columns):
print(ind, column)
Upvotes: 66
Reputation: 2672
Based on the accepted answer, if an index corresponding to each column is also desired:
for i, column in enumerate(df):
print i, df[column]
The above df[column]
type is Series
, which can simply be converted into numpy
ndarray
s:
for i, column in enumerate(df):
print i, np.asarray(df[column])
Upvotes: 9
Reputation: 1175
I'm a bit late but here's how I did this. The steps:
This is the code I used on DataFrame called aft_tmt
. Feel free to extrapolate to your use case..
import pandas as pd
# setting options to print without truncating output
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
import statsmodels.formula.api as smf
import itertools
# This section gets the column names of the DF and removes some columns which I don't want to use as predictors.
itercols = aft_tmt.columns.tolist()
itercols.remove("sc97")
itercols.remove("sc")
itercols.remove("grc")
itercols.remove("grc97")
print itercols
len(itercols)
# results DF
regression_res = pd.DataFrame(columns = ["Rsq", "predictors", "excluded"])
# excluded cols
exc = []
# change 9 to the number of columns you want to combine from N columns.
#Possibly run an outer loop from 0 to N/2?
for x in itertools.combinations(itercols, 9):
lmstr = "+".join(x)
m = smf.ols(formula = "sc ~ " + lmstr, data = aft_tmt)
f = m.fit()
exc = [item for item in x if item not in itercols]
regression_res = regression_res.append(pd.DataFrame([[f.rsquared, lmstr, "+".join([y for y in itercols if y not in list(x)])]], columns = ["Rsq", "predictors", "excluded"]))
regression_res.sort_values(by="Rsq", ascending = False)
Upvotes: 4
Reputation: 2324
Using list comprehension, you can get all the columns names (header):
[column for column in df]
Upvotes: 11
Reputation: 6711
A workaround is to transpose the DataFrame
and iterate over the rows.
for column_name, column in df.transpose().iterrows():
print column_name
Upvotes: 18