Ivan Simsic-Babic
Ivan Simsic-Babic

Reputation: 17

How to select columns based on value they contain pandas

I am working in pandas with a certain dataset that describes the population of a certain country per year. The dataset is construed in a weird way wherein the years aren't the columns themselves but rather the years are a value within the first row of the set. The dataset describes every year from 1960 up til now but I only need 1970, 1980, 1990 etc. For this purpose I've created a list with all those years and tried to make a new dataset which is equivalent to the old one but only has the columns that contain a value from said list so I don't have all this extra info I'm not using. Online I can only find instructions for removing rows or selecting by column name, since both these criteria don't apply in this situation I thought i should ask here. The dataset is a csv file which I've downloaded off some world population site. here a link to a screenshot of the data

As you can see the years are given in scientific notation for some years, which is also how I've added them to my list.

pop = pd.read_csv('./maps/API_SP.POP.TOTL_DS2_en_csv_v2_10576638.csv', 

header=None, engine='python', skiprows=4)
display(pop)
years = ['1.970000e+03','1.980000e+03','1.990000e+03','2.000000e+03','2.010000e+03','2.015000e+03', 'Country Name']
pop[pop.columns[pop.isin(years).any()]]

This is one of the things I've tried so far which I thought made the most sense, but I am still very new to pandas so any help would be greatly appreciated.

Upvotes: 1

Views: 333

Answers (1)

Brendan
Brendan

Reputation: 4011

Using the data at https://data.worldbank.org/indicator/sp.pop.totl, copied into pastebin (first time using the service, so apologies if it doesn't work for some reason):

# actual code using CSV file saved to desktop
#df = pd.read_csv(<path to CSV>, skiprows=4)

# pastebin for reproducibility
df = pd.read_csv(r'https://pastebin.com/raw/LmdGySCf',sep='\t')

# manually select years and other columns of interest
colsX = ['Country Name', 'Country Code', 'Indicator Name', 'Indicator Code',
         '1990', '1995', '2000'] 
dfX = df[colsX]

# select every fifth year
colsY = df.filter(regex='19|20', axis=1).columns[[int(col) % 5 == 0 for col in df.filter(regex='19|20', axis=1).columns]]
dfY = df[colsY]

As a general comment:

The dataset is construed in a weird way wherein the years aren't the columns themselves but rather the years are a value within the first row of the set.

This is not correct. Viewing the CSV file, it is quite clear that row 5 (Country Name, Country Code, Indicator Name, Indicator Code, 1960, 1961, ...) are indeed column names. You have read the data into pandas in such a way that those values are not column years, but your first step, before trying to subset your data, should be to ensure you have read in the data properly -- which, in this case, would give you column headers named for each year.

Upvotes: 1

Related Questions