antonio_zeus
antonio_zeus

Reputation: 497

pandas dataframe, setting index_col to my csv name

I have a question in regards to using pd.read_csv I am currently building a dataframe from multiple csv files from a folder and the csv files are named as follows: "C2__1979H" or "C2_1999Z"

I would like to set the index of my dataFrame to equal the name of the CSV file it is currently pulling to create my dataframe. I have yet to find a way to do that. Here is my current code

my dataframe looks like this:

    Date     Open    High     Low   Close     Vol  OI  Roll
0   19780106  236.00  237.50  234.50  235.50    0   0     0
1   19780113  235.50  239.00  235.00  238.25    0   0     0
2   19780120  238.00  239.00  234.50  237.00    0   0     0
3   19780127  237.00  238.50  235.50  236.00    0   0     0

I want it to look like this

            Date       Open    High     Low   Close    Vol  OI  Roll
C2__1979N   19780106  236.00  237.50  234.50  235.50    0   0     0
C2__1979N   19780113  235.50  239.00  235.00  238.25    0   0     0
C2__1979N   19780120  238.00  239.00  234.50  237.00    0   0     0
C2__1979Z   19780127  237.00  238.50  235.50  236.00    0   0     0 ##(assuming this is where the next csv file began)

Upvotes: 2

Views: 5905

Answers (2)

Romain
Romain

Reputation: 21888

It does the trick.

import os

df_temp = pd.DataFrame({'Close': [235.5, 238.25, 237.0, 236.0],
 'Date': [19780106, 19780113, 19780120, 19780127],
 'High': [237.5, 239.0, 239.0, 238.5],
 'Low': [234.5, 235.0, 234.5, 235.5],
 'OI': [0, 0, 0, 0],
 'Open': [236.0, 235.5, 238.0, 237.0],
 'Roll': [0, 0, 0, 0],
 'Vol': [0, 0, 0, 0]})

df = pd.DataFrame()

# To simulate several df
x=0
for file_ in ['the_path/C2__1979N.csv', 'other_path/C2__1979H.csv']:
    filename, file_extension = os.path.splitext(file_)
    df_temp['name'] = os.path.basename(filename)
    df = df.append(df_temp.loc[x:x+1,:])
    x+=1

df.set_index('name', inplace=True)
df.index.name = None
print(df)

# Result
            Close      Date   High    Low  OI   Open  Roll  Vol
C2__1979N  235.50  19780106  237.5  234.5   0  236.0     0    0
C2__1979N  238.25  19780113  239.0  235.0   0  235.5     0    0
C2__1979H  237.00  19780120  239.0  234.5   0  238.0     0    0
C2__1979H  236.00  19780127  238.5  235.5   0  237.0     0    0

In the original code:

for file_ in allFiles:
    names = ['Date', 'Open', 'High', 'Low', 'Close', 'Vol', 'OI', 'Roll']
    df_temp = pd.read_csv(file_, index_col = None, names = names)
    df_temp['Roll'] = 0
    df_temp.iloc[-2,-1] = 1
    filename, file_extension = os.path.splitext(file_)
    df_temp['name'] = os.path.basename(filename)
    df = df.append(df_temp)

df = df.reset_index(drop=True)
df.set_index('name', inplace=True)
df.index.name = None
df = df[names]

df = df.drop_duplicates('Date') ## remove duplicate rows with same date

Upvotes: 2

hellpanderr
hellpanderr

Reputation: 5896

Have you tried the obvious one?

df_temp.index = [file_]*len(df_temp)

Upvotes: 0

Related Questions