Reputation: 37
I created a panda dataframe and am trying to split one of the columns (dtype = object) into multiple columns by separator.
I tried:
new = df["contract"].str.split(",", n=100, expand=True)
df[['Test2', 'conID', 'Test1', 'Expiration', 'Strike', 'Type', 'Multiplier', 'Exchange', 'Currency', 'Code', 'tradingClass']] = pd.DataFrame(df['contract'].tolist(), index=df.index)
df['contract_new'] = df['contract'].str.split(',')
df['contract_new'] = df['contract'].astype('str')
df['contract_new'] = df['contract'].str.replace('(', ',')
Below I have copied in the first three rows of the content of the panda df column with header 'contract'. It is a long field with 10 important data points which I need to be provided in different columns. The dataframe is retrieved from Interactive Brokers API.
contract
0 Option(conId=357974235, symbol='SPX', lastTradeDateOrContractMonth='20190718', strike=2980.0, right='P', multiplier='100', exchange='SMART', currency='USD', localSymbol='SPX 190719P02980000', tradingClass='SPX')
1 Option(conId=357974238, symbol='SPX', lastTradeDateOrContractMonth='20190718', strike=2985.0, right='P', multiplier='100', exchange='SMART', currency='USD', localSymbol='SPX 190719P02985000', tradingClass='SPX')
2 Option(conId=357974242, symbol='SPX', lastTradeDateOrContractMonth='20190718', strike=2990.0, right='P', multiplier='100', exchange='SMART', currency='USD', localSymbol='SPX 190719P02990000', tradingClass='SPX')
I would like to split the column with the 10 strings which are separated by the comma's into 10 separate columns, or perform other string based actions. In the end I want to see the following items in different columns:
-SPX
-201907192980.0
-P
-100
-SMART
-USD
-190719P02980000 (THIS IS THE MOST IMPORTANT PART I NEED)
-SPX
So far nothing works.
Upvotes: 2
Views: 657
Reputation: 4521
I just updated my answer. I try to use pandas operations since it seems you already have your data in a pandas DataFrame. The method should be reasonably robust regarding missing key-value pairs and their order in the input string:
import re
re_opt_start= re.compile('Option\(')
re_opt_end= re.compile('\)\s*')
re_split= re.compile('\s*,\s*')
df['contract']= df['contract'].str.replace(re_opt_start, '')
df['contract']= df['contract'].str.replace(re_opt_end, '')
df_split= df['contract'].str.split(',', expand=True)
result_df= None
for column in df_split:
col_df= df_split[column].str.strip().str.split('=', expand=True)
col_df.columns= ['col', 'value']
col_df['value']= col_df['value'].str.strip("'")
col_df.set_index('col', append=True, inplace=True)
if result_df is None:
result_df= col_df
else:
result_df= pd.concat([result_df, col_df], axis='index')
unstacked_df=result_df.unstack(level=-1).droplevel(0, axis='columns')
unstacked_df.loc[unstacked_df['localSymbol'].str[:3] == 'SPX', 'localSymbol']= unstacked_df['localSymbol'].str[3:]
unstacked_df
This returns:
Out[1285]:
col conId currency exchange lastTradeDateOrContractMonth ... right strike symbol tradingClass
0 357974235 USD SMART 20190718 ... P 2980.0 SPX SPX
1 357974238 USD SMART 20190718 ... P 2985.0 SPX SPX
2 357974242 USD SMART 20190718 ... P 2990.0 SPX SPX
[3 rows x 10 columns]
Upvotes: 2
Reputation: 24930
This can be done without regex by performing the following string manipulation (or something similar) on the data in each row. Using your first row as an example:
option = "conId=357974235, symbol='SPX', lastTradeDateOrContractMonth='20190718', strike=2980.0, right='P', multiplier='100', exchange='SMART', currency='USD', localSymbol='SPX 190719P02980000', tradingClass='SPX'"
data = option.split(',')
to_delete = 0,3 #since apparently you aren't interested in 'conId' and 'strike'
for i in sorted(to_delete, reverse=True):
del data[i]
for datum in data:
if "localSymbol" in datum:
datum = datum.replace('SPX ','')
print(datum.split('=')[1])
Output:
'SPX'
'20190718'
'P'
'100'
'SMART'
'USD'
'190719P02980000'
'SPX'
To automate the process, let's assume the data is stored like this:
option1 = "conId=357974235, symbol='SPX', lastTradeDateOrContractMonth='20190718', strike=2980.0, right='P', multiplier='100', exchange='SMART', currency='USD', localSymbol='SPX 190719P02980000', tradingClass='SPX'"
option2 = "conId=357974238, symbol='SPX', lastTradeDateOrContractMonth='20190718', strike=2985.0, right='P', multiplier='100', exchange='SMART', currency='USD', localSymbol='SPX 190719P02985000', tradingClass='SPX'"
etc. The code above is then modified as follows:
options = [option1, option2] #etc.
option_data = [] #this is a list of lists which will host all relevant data
to_delete = 0,3
for option in options:
data = option.split(',')
for i in sorted(to_delete, reverse=True):
del data[i]
current_datum = [] #this is a one time list that will store data for the current item
for datum in data:
if "localSymbol" in datum:
datum = datum.replace('SPX ','')
current_datum.append(datum.split('=')[1])
option_data.append(current_datum)
Finally, create the dataframe:
columns = ['symbol','last trade','right','multiplier','exchange','currency','local symbol','trading class']
df = pd.DataFrame(option_data, columns =columns)
df
Output:
symbol last trade right multiplier exchange currency local symbol trading class
0 'SPX' '20190718' 'P' '100' 'SMART' 'USD' '190719P02980000' 'SPX'
1 'SPX' '20190718' 'P' '100' 'SMART' 'USD' '190719P02985000' 'SPX'
Upvotes: 1
Reputation: 106
I guess you are looking for something like this:
def contract_to_columns(c):
return pd.Series({"conId": c.conId, "symbol": c.symbol, "multiplier": c.multiplier,
"lastTradeDateOrContractMonth": c.lastTradeDateOrContractMonth,
"strike": c.strike, "right": c.right, "exchange": c.exchange,
"currency": c.currency, "localSymbol": c.localSymbol.split()[1],
"tradingClass": c.tradingClass})
df['contract'].apply(contract_to_columns)
your contract
column is an object, what you need is to do a mapping from contract object to multiple columns, notice that the column localSymbol
has a prefix (SPX) you don't need, I removed it.
This code depends also on the definition of class Option
, if you want more help, please share the code of Option
class.
Cheers
Upvotes: 1