VJ.BG
VJ.BG

Reputation: 93

Python - same kind of data, same code, but different outcomes

I apologize in advance if I'm mistaken, but the codes and data shown below that look identical to me produce different outcomes. My data looks like:

csv file

1907-09-01,108,13.5,7.9,20.7
1907-09-02,108,16.2,7.9,22
1907-09-03,108,16.2,13.1,21.3
1907-10-04,108,16.5,11.2,22
1907-10-05,108,17.6,10.9,25.4
1907-10-06,108,13,11.2,21.3
1907-11-07,108,11.3,6.3,16.1
1907-11-08,108,8.9,3.9,14.9
1907-11-09,108,11.6,3.8,21.1
1907-11-10,108,14.2,6.4,24.1
1907-11-11,108,15.4,10.1,20.4
1907-12-12,108,13.9,11.1,17.4
1907-12-13,108,13.8,8.3,21.3
1907-12-14,108,13,6.1,20.6
1907-12-15,108,13.1,5.7,20.9

code

f = open('ta_20200826183704.csv', 'r')
data = csv.reader(f)
header = next(data)

for row in data:
    print(row)


['1907-10-01', '108', '13.5', '7.9', '20.7']
['1907-10-02', '108', '16.2', '7.9', '22']
['1907-10-03', '108', '16.2', '13.1', '21.3']
['1907-10-04', '108', '16.5', '11.2', '22']
......

The dataset has about 40,000 days' data points. With this dataset, I tried to take the data points in the last column (highest temp of the day) and put them into the a list within a list depending on the month of the day (e.g., into the corresponding list within all_month = [[], [], [], [], [], [], [], [], [], [], [], []] like all January data points go to the first nested list). Put it simply, I tried to do groupby('month') of pandas manually.

When I run the code below:

import csv

f = open('ta_20200826183704.csv', 'r')
data = csv.reader(f)
header = next(data)

all_month = []
month = []

for i in range(1,13):
    all_month.append(month)
    
for row in data:
    month = int(row[0].split('-')[1])
    for i in range(1,13):
        if month == i:
            all_month[i-1].append(row[-1])

The output has the same data in every nested list, meaning that the data points in the last column were not grouped by month (i.e., all of those points were put into every nested list).

What really puzzles me is that when I entered a small subset of the same data manually, I was able to get intended results:

test_list = [[],[],[],[],[],[],[],[],[],[],[],[]]
test_data = [['1907-09-01', '108', '13.5', '7.9', '20.7'],
['1907-09-02', '108', '16.2', '7.9', '22'],
['1907-09-03', '108', '16.2', '13.1', '21.3'],
['1907-10-04', '108', '16.5', '11.2', '22'],
['1907-10-05', '108', '17.6', '10.9', '25.4'],
['1907-10-06', '108', '13', '11.2', '21.3'],
['1907-11-07', '108', '11.3', '6.3', '16.1'],
['1907-11-08', '108', '8.9', '3.9', '14.9'],
['1907-11-09', '108', '11.6', '3.8', '21.1'],
['1907-11-10', '108', '14.2', '6.4', '24.1'],
['1907-11-11', '108', '15.4', '10.1', '20.4'],
['1907-12-12', '108', '13.9', '11.1', '17.4'],
['1907-12-13', '108', '13.8', '8.3', '21.3'],
['1907-12-14', '108', '13', '6.1', '20.6'],
['1907-12-15', '108', '13.1', '5.7', '20.9']]

for row in test_data:
    month = int(row[0].split('-')[1])
    for i in range(1,13):
        if month == i:
            test_list[i-1].append(row[-1])

The output is:

[[],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 ['20.7', '22', '21.3'],
 ['22', '25.4', '21.3'],
 ['16.1', '14.9', '21.1', '24.1', '20.4'],
 ['17.4', '21.3', '20.6', '20.9']]

The only difference I can notice between the two codes is how the data were entered(or read).

It would be really appreciated if anyone could point me to what I did wrong and/or why different outcomes were generated.

Upvotes: 2

Views: 65

Answers (2)

Pubudu Sitinamaluwa
Pubudu Sitinamaluwa

Reputation: 978

I agree to Michael's answer. But why not use pandas directly?

import pandas as pd
    
test_data = [['1907-09-01', '108', '13.5', '7.9', '20.7'],
['1907-09-02', '108', '16.2', '7.9', '22'],
['1907-09-03', '108', '16.2', '13.1', '21.3'],
['1907-10-04', '108', '16.5', '11.2', '22'],
['1907-10-05', '108', '17.6', '10.9', '25.4'],
['1907-10-06', '108', '13', '11.2', '21.3'],
['1907-11-07', '108', '11.3', '6.3', '16.1'],
['1907-11-08', '108', '8.9', '3.9', '14.9'],
['1907-11-09', '108', '11.6', '3.8', '21.1'],
['1907-11-10', '108', '14.2', '6.4', '24.1'],
['1907-11-11', '108', '15.4', '10.1', '20.4'],
['1907-12-12', '108', '13.9', '11.1', '17.4'],
['1907-12-13', '108', '13.8', '8.3', '21.3'],
['1907-12-14', '108', '13', '6.1', '20.6'],
['1907-12-15', '108', '13.1', '5.7', '20.9']]


df = pd.DataFrame(test_data) 

# You can load this directly from your csv like
# df = pd.read_csv("filename.csv") 

df[0] = pd.to_datetime(df[0])

grouped_df = df.groupby(df[0].dt.strftime('%m'))[4].apply(list).sort_values()

print(grouped_df)

enter image description here

Upvotes: 0

Michael
Michael

Reputation: 2414

Let's look at this code block:

all_month = []
month = []

for i in range(1,13):
    all_month.append(month)

This puts 1 list into all_month twelve times; what you expected was a list containing 12 distinct lists, but what this code does is basically create 12 pointers to a single list object. If you use any of those pointers, you're modifying or reading a list that's common to all of them, which is what you're seeing. You can see this by calling id() on each entry of all_month; you'll see that they all have the same id.

There are lots of solutions here, if you use all_month.append([]) then each time you append to all_month you'll be inserting a new list.

Upvotes: 1

Related Questions