panzers
panzers

Reputation: 63

pandas - 'weird' dataframe format

i have a script to drop duplicate timings based on HH:MM, and sum it. but the weird thing is....
1. output dataframe seem to have 'Name: Time, dtype: object' below it.
2. no column name for my summed time
3. i cant seem to add new column properly pls tell me what is happening....!

dataset:

Time      Filename
12:00:00  file1.txt
12:01:00  file1.txt
12:02:11  file1.txt
12:01:50  file2.txt
12:02:11  file2.txt
12:01:50  file3.txt
12:02:11  file3.txt
12:03:50  file3.txt


output:
Filename
file1.txt     3
file2.txt     2
file3.txt     3
Name: Time, dtype: object

so far this is my code

import pandas as pd
from pathlib import Path

folder = Path(r"C:\\Users\\CL\\Desktop\\Script") # path to your folder
files = list(folder.rglob("*.txt")) #get all the txt file
df = pd.DataFrame()
pattern = r"(\d{2}):(\d{2}):(\d{2})" #pattern to read

for file in files:
    _df = pd.read_csv(file, dtype=str, names=['Time'])
    _df['Filename'] = os.path.split(file)[-1]
    df = df.append(_df)  
    filtered_df = df[df.Time.str.match(pattern)]

#sum non-duplicate timing   
new_df = filtered_df.groupby("Filename")["Time"].apply(lambda d: d.str[:5].nunique())

#WEIRD output when i testing to add new column
new_df['Filename'] = filtered_df['Filename']

print(new_df)
'weird' output after adding new col:

Filename
1.txt                                                    3
2.txt                                                    2
3.txt                                                    3
Filename        2      2.txt
5      2.txt
2     11...
Name: Duration, dtype: object

update edits 12/9/21

import pandas as pd
from pathlib import Path

folder = Path(r"C:\\Users\\CL\\Desktop\\Script") # path to your folder
files = list(folder.rglob("*.txt")) #get all the txt file
df = pd.DataFrame()
pattern = r"(\d{2}):(\d{2}):(\d{2})" #pattern to read

for file in files:
    _df = pd.read_csv(file, dtype=str, names=['Time'])
    _df['Filename'] = os.path.split(file)[-1]
    _df['Path'] = "_".join(file.parts[-5:]) #new code - to get file path
    df = df.append(_df)  
    filtered_df = df[df.Time.str.match(pattern)]

new_df = filtered_df.groupby(["Filename","Path"])["Duration"].apply(lambda d: d.str[:5].nunique()).reset_index()



df.to_csv('new_df.csv', index=False) #saved in python folder

output on jupyter
   Filename  Duration      Path
0  1.txt         3         2021_01_01_cat_1.txt
1  2.txt         2         2021_01_01_cat_2.txt
2  3.txt         3         2021_01_01_cat_3.txt

output on csv

#shows unfiltered data       

Upvotes: 1

Views: 653

Answers (1)

SeaBean
SeaBean

Reputation: 23217

The problem you face is owing to the resulting dataframe is NOT a Pandas dataframe but a Pandas Series.

You can try to get your result as dataframe by either:

  1. adding as_index=False to your .groupby statement, or
  2. reset_index() after it:
new_df = filtered_df.groupby("Filename", as_index=False)["Time"].apply(lambda d: d.str[:5].nunique())

or

new_df = filtered_df.groupby("Filename")["Time"].apply(lambda d: d.str[:5].nunique()).reset_index()

Upvotes: 2

Related Questions