Jiří Pešík
Jiří Pešík

Reputation: 334

Create a new DataFrame from selected data from another DataFrame

I want to create a box plot using pandas. I have data with average temperatures and I want to select three cities and create three box plots to compare temperatures among these cities. To achieve this, I have created a result DataFrame to store the data, the values for cities are supposed to be stored in three columns (one column per city).

However, the following code only shows plot for the first city. The problem is with the DataFrame. A separated query correctly gives a series of values, but when I insert it into the result dataset, a column of NaN values is stored there. What I am missing here?


import pandas
import matplotlib.pyplot as plt
import wget
wget.download("https://raw.githubusercontent.com/pesikj/python-012021/master/zadani/5/temperature.csv")

temperatures = pandas.read_csv("temperature.csv")
helsinki = temperatures[temperatures["City"] == "Helsinki"]["AvgTemperature"]
miami = temperatures[temperatures["City"] == "Miami Beach"]["AvgTemperature"]
tokyo = temperatures[temperatures["City"] == "Tokyo"]["AvgTemperature"]
result = pandas.DataFrame()
result["Helsinki"] = helsinki
result["Miami Beach"] = miami
result["Tokyo"] = tokyo
result.plot(kind="box",whis=[0,100])
plt.show()

Upvotes: 1

Views: 299

Answers (2)

jfaccioni
jfaccioni

Reputation: 7509

Since you're using data science packages, consider using seaborn, which does the job of filtering/grouping data for you whenever you call one of its plot functions:

# Load dataset
url = "https://raw.githubusercontent.com/pesikj/python-012021/master/zadani/5/temperature.csv"
temperatures = pd.read_csv(url)

# Filter for cities of interest
cities = ['Helsinki', 'Miami Beach', 'Tokyo']
filtered_temperatures = temperatures.loc[temperatures['City'].isin(cities)]

# Let seaborn do the grouping
sns.violinplot(data=filtered_temperatures, x='City', y='AvgTemperature')
plt.show()

Result: enter image description here

Upvotes: 1

tdy
tdy

Reputation: 41327

Pivot into City columns using pivot_table() and select the 3 cities you want:

result = temperatures.pivot_table(
    index='Day',
    columns='City',
    values='AvgTemperature',
)[['Helsinki', 'Miami Beach', 'Tokyo']]

# City  Helsinki  Miami Beach  Tokyo
# Day                               
# 1         29.6         74.6   59.1
# 2         29.5         76.8   62.3
# ...
# 29        35.3         77.7   58.4
# 30        35.7         78.0   51.5
result.plot(kind='box', whis=[0,100])

temp box plot

Upvotes: 2

Related Questions