Reputation:
I'm using a for to generate a excel file to graph the data from a df so I'm using value_counts but I would like to add under this df a second one with the same data but with percentages so my code is this one:
li = []
for i in range(0, len(df.columns)):
value_counts = df.iloc[:, i].value_counts().to_frame().reset_index()
value_percentage = df.iloc[:, i].value_counts(normalize=True).to_frame().reset_index()#.drop(columns='index')
value_percentage = (value_percentage*100).astype(str)+'%'
li.append(value_counts)
li.append(value_percentage)
data = pd.concat(li, axis=1)
data.to_excel("resultdf.xlsx") #index cleaned
Basically I need it to look like this:
Upvotes: 2
Views: 214
Reputation: 7230
Let's prepare some dummy data to work with. Based on the provided screenshot, I'm assuming that the raw data are sort of music genres grade on a scale of 1 to 5. So I'm gonna use as data something like this:
import pandas as pd
from numpy.random import default_rng
rng = default_rng(0)
columns = ['Pop', 'Dance', 'Rock', 'Jazz']
data = rng.integers(1, 5, size=(100, len(columns)), endpoint=True)
df = pd.DataFrame(data, columns=columns)
for column in df.columns: df[column] ...
map('.0%'.format)
before transforming them to frame.pd.concat
them vertically into a single frame and append it to the list.So the original code may be rewritten like this:
li = []
for col in df.columns:
value_counts = df[col].value_counts()
value_percentage = df[col].value_counts(normalize=True).map('{:.0%}'.format)
li.append(pd.concat([value_counts, value_percentage]).to_frame().reset_index())
resultdf = pd.concat(li, axis=1)
resultdf.to_excel("resultdf.xlsx")
What if we let Excel format the data as percentages on its own? I think that the easiest way to do this is to use Styler
. But before that, I suggest to get rid of Index
columns. As I can see, all of them refer to the same grades 1,2,3,4,5
. So we can use them as the common index thus making indexes meaningful. Also I'm gonna use MultiIndex
to separate counted and normalized values like this:
formula = ['counts', 'percent']
values = [1, 2, 3, 4, 5]
counted = pd.DataFrame(index=pd.MultiIndex.from_product([formula, values], names=['formula', 'values']))
counted
is our data container and it's empty at the moment. Let's fill it in:
for col in df.columns:
counts = df[col].value_counts()
percent = counts / counts.sum()
counted[col] = pd.concat([counts, percent], keys=formula)
Having these data, let's apply some style to them and only then transform into an Excel file:
styled_data = (
counted.style
.set_properties(**{'number-format': '0'}, subset=pd.IndexSlice['counts', columns])
.set_properties(**{'number-format': '0%'}, subset=pd.IndexSlice['percent', columns])
)
styled_data.to_excel('test.xlsx')
Now our data in Excel are looking like this:
All of them are numbers and we can use them in further calculations.
from pandas import DataFrame, MultiIndex, IndexSlice, concat
from numpy.random import default_rng
# Initial parameters
rng = default_rng(0)
data_length = 100
genres = ['Pop', 'Dance', 'Rock', 'Jazz']
values = [1, 2, 3, 4, 5]
formula = ['counts', 'percent']
file_name = 'test.xlsx'
# Prepare data
data = rng.integers(min(values), max(values), size=(data_length, len(genres)), endpoint=True)
df = DataFrame(data, columns=genres)
# Prepare a container for counted data
index = MultiIndex.from_product([formula, values], names=['formula', 'values'])
counted = DataFrame(index=index)
# Fill in counted data
for col in df.columns:
counts = df[col].value_counts()
percent = counts / counts.sum()
counted[col] = concat([counts, percent], keys=formula)
# Apply number formatting and save the data in a Excel file
styled_data = (
counted.style
.set_properties(**{'number-format': '0'}, subset=IndexSlice['counts', :])
.set_properties(**{'number-format': '0%'}, subset=IndexSlice['percent', :])
)
styled_data.to_excel(file_name)
Note not to get confused. In case of the used dummy data we can see identical values in counts
and percent
parts. That's because of how data were built. I used 100 total number of values in the initial data frame df
. So the number of value_counts
and their percentage are equal.
python 3.11.0
pandas 1.5.1
numpy 1.23.4
If we wanna keep values for each column of the original data, but use Styler
to set a number format for a second half of the output frame, then we should somehow rename Index
columns, because Styler
requires unique column/index labels in a passed DataFrame
. We can ether rename them somehow (e.g. "Values.Pop", etc.) or we can use a multi indexing for columns, which IMO looks better. Also let's take into account that number of unique values may differ for different columns. Which means that we have to collect data separately for couts and percent values before connecting them:
import pandas as pd
from numpy.random import default_rng
# Prepare dummy data with missing values in some columns
rng = default_rng(0)
columns = ['Pop', 'Dance', 'Rock', 'Jazz']
data = rng.integers(1, 5, size=(100, len(columns)), endpoint=True)
df = pd.DataFrame(data, columns=columns)
df['Pop'].replace([1,5], 2, inplace=True)
df['Dance'].replace(3, 5, inplace=True)
# Collect counted values and their percentage
counts, percent = [], []
for col in df.columns:
item = (
df[col].value_counts()
.rename('count')
.rename_axis('value')
.to_frame()
.reset_index()
)
counts.append(item)
percent.append(item.assign(count=item['count']/item['count'].sum()))
# Combine counts and percent in a single data frame
counts = pd.concat(counts, axis=1, keys=df.columns)
percent = pd.concat(percent, axis=1, keys=df.columns)
resultdf = pd.concat([counts, percent], ignore_index=True)
# Note: In order to use resultdf in styling we should produce
# unique index labels for the output data.
# For this purpose we can use ignore_index=True
# or assign some keys for each part, e.g. key=['counted', 'percent']
# Format the second half of resultdf as Percent, ie. "0%" in Excel terminology
styled_result = (
resultdf.style
.set_properties(
**{'number-format': '0%'},
subset=pd.IndexSlice[len(resultdf)/2:, pd.IndexSlice[:,'count']])
# if we used keys instead of ignore_index to produce resultdf
# then len(resultdf)/2: should be replaced with 'percent'
# i.e. the name of the percent part.
)
styled_result.to_excel('my_new_excel.xlsx')
The output in this case is gonna look like this:
Upvotes: 1
Reputation: 397
As long as the column names match between the two data frames you should be able to use pd.concat()
to concatenate the two data frames. To concatenate them vertically, I think you should use axis=0
instead of axis=1
see docs
Upvotes: 3