user16239103
user16239103

Reputation:

Add df under other df Pandas

I'm using a for to generate a excel file to graph the data from a df so I'm using value_counts but I would like to add under this df a second one with the same data but with percentages so my code is this one:

li = []

for i in range(0, len(df.columns)):
    value_counts = df.iloc[:, i].value_counts().to_frame().reset_index()
    value_percentage = df.iloc[:, i].value_counts(normalize=True).to_frame().reset_index()#.drop(columns='index')
    value_percentage = (value_percentage*100).astype(str)+'%'
    li.append(value_counts)
    li.append(value_percentage)
data = pd.concat(li, axis=1)
data.to_excel("resultdf.xlsx") #index cleaned

Basically I need it to look like this:

enter image description here

Upvotes: 2

Views: 214

Answers (2)

Vitalizzare
Vitalizzare

Reputation: 7230

Data

Let's prepare some dummy data to work with. Based on the provided screenshot, I'm assuming that the raw data are sort of music genres grade on a scale of 1 to 5. So I'm gonna use as data something like this:

import pandas as pd
from numpy.random import default_rng

rng = default_rng(0)
columns = ['Pop', 'Dance', 'Rock', 'Jazz']
data = rng.integers(1, 5, size=(100, len(columns)), endpoint=True)

df = pd.DataFrame(data, columns=columns)

Notes on the original code

  1. There's no need to iterate by a column index. We can iterate through column names, as in for column in df.columns: df[column] ...
  2. I think it's better to format data with help of map('.0%'.format) before transforming them to frame.
  3. Instead of appending counted and normalized values one by one we better pd.concat them vertically into a single frame and append it to the list.

So the original code may be rewritten like this:

li = []

for col in df.columns:
    value_counts = df[col].value_counts()
    value_percentage = df[col].value_counts(normalize=True).map('{:.0%}'.format)
    li.append(pd.concat([value_counts, value_percentage]).to_frame().reset_index())

resultdf = pd.concat(li, axis=1)
resultdf.to_excel("resultdf.xlsx") 

Let Excel do formatting

What if we let Excel format the data as percentages on its own? I think that the easiest way to do this is to use Styler. But before that, I suggest to get rid of Index columns. As I can see, all of them refer to the same grades 1,2,3,4,5. So we can use them as the common index thus making indexes meaningful. Also I'm gonna use MultiIndex to separate counted and normalized values like this:

formula = ['counts', 'percent']
values = [1, 2, 3, 4, 5]
counted = pd.DataFrame(index=pd.MultiIndex.from_product([formula, values], names=['formula', 'values']))

counted is our data container and it's empty at the moment. Let's fill it in:

for col in df.columns:
    counts = df[col].value_counts()
    percent = counts / counts.sum()
    counted[col] = pd.concat([counts, percent], keys=formula)

Having these data, let's apply some style to them and only then transform into an Excel file:

styled_data = (
    counted.style
    .set_properties(**{'number-format': '0'}, subset=pd.IndexSlice['counts', columns])
    .set_properties(**{'number-format': '0%'}, subset=pd.IndexSlice['percent', columns])
)
styled_data.to_excel('test.xlsx')

Now our data in Excel are looking like this:

excel data

All of them are numbers and we can use them in further calculations.

Full code

from pandas import DataFrame, MultiIndex, IndexSlice, concat
from numpy.random import default_rng

# Initial parameters
rng = default_rng(0)
data_length = 100
genres = ['Pop', 'Dance', 'Rock', 'Jazz']
values = [1, 2, 3, 4, 5]
formula = ['counts', 'percent']
file_name = 'test.xlsx'

# Prepare data
data = rng.integers(min(values), max(values), size=(data_length, len(genres)), endpoint=True)
df = DataFrame(data, columns=genres)

# Prepare a container for counted data 
index = MultiIndex.from_product([formula, values], names=['formula', 'values'])
counted = DataFrame(index=index)

# Fill in counted data
for col in df.columns:
    counts = df[col].value_counts()
    percent = counts / counts.sum()
    counted[col] = concat([counts, percent], keys=formula)

# Apply number formatting and save the data in a Excel file
styled_data = (
    counted.style
    .set_properties(**{'number-format': '0'}, subset=IndexSlice['counts', :])
    .set_properties(**{'number-format': '0%'}, subset=IndexSlice['percent', :])
)
styled_data.to_excel(file_name)

P.S.

Note not to get confused. In case of the used dummy data we can see identical values in counts and percent parts. That's because of how data were built. I used 100 total number of values in the initial data frame df. So the number of value_counts and their percentage are equal.


python 3.11.0
pandas 1.5.1
numpy 1.23.4

Update

If we wanna keep values for each column of the original data, but use Styler to set a number format for a second half of the output frame, then we should somehow rename Index columns, because Styler requires unique column/index labels in a passed DataFrame. We can ether rename them somehow (e.g. "Values.Pop", etc.) or we can use a multi indexing for columns, which IMO looks better. Also let's take into account that number of unique values may differ for different columns. Which means that we have to collect data separately for couts and percent values before connecting them:

import pandas as pd
from numpy.random import default_rng

# Prepare dummy data with missing values in some columns
rng = default_rng(0)
columns = ['Pop', 'Dance', 'Rock', 'Jazz']
data = rng.integers(1, 5, size=(100, len(columns)), endpoint=True)

df = pd.DataFrame(data, columns=columns)
df['Pop'].replace([1,5], 2, inplace=True)
df['Dance'].replace(3, 5, inplace=True)

# Collect counted values and their percentage
counts, percent = [], []

for col in df.columns:
    item = (
        df[col].value_counts()
        .rename('count')
        .rename_axis('value')
        .to_frame()
        .reset_index()
    )
    counts.append(item)
    percent.append(item.assign(count=item['count']/item['count'].sum()))

# Combine counts and percent in a single data frame
counts = pd.concat(counts, axis=1, keys=df.columns)
percent = pd.concat(percent, axis=1, keys=df.columns)
resultdf = pd.concat([counts, percent], ignore_index=True)
    # Note: In order to use resultdf in styling we should produce
    # unique index labels for the output data. 
    # For this purpose we can use ignore_index=True
    # or assign some keys for each part, e.g. key=['counted', 'percent']

# Format the second half of resultdf as Percent, ie. "0%" in Excel terminology
styled_result = (
    resultdf.style
    .set_properties(
        **{'number-format': '0%'}, 
        subset=pd.IndexSlice[len(resultdf)/2:, pd.IndexSlice[:,'count']])
        # if we used keys instead of ignore_index to produce resultdf
        # then len(resultdf)/2: should be replaced with 'percent'
        # i.e. the name of the percent part.
)
styled_result.to_excel('my_new_excel.xlsx')

The output in this case is gonna look like this:

output with missing values

Upvotes: 1

Moritz Wilksch
Moritz Wilksch

Reputation: 397

As long as the column names match between the two data frames you should be able to use pd.concat() to concatenate the two data frames. To concatenate them vertically, I think you should use axis=0 instead of axis=1 see docs

Upvotes: 3

Related Questions