Reputation: 63
i have a script to drop duplicate timings based on HH:MM, and sum it. but the weird thing is....
1. output dataframe seem to have 'Name: Time, dtype: object' below it.
2. no column name for my summed time
3. i cant seem to add new column properly
pls tell me what is happening....!
dataset:
Time Filename
12:00:00 file1.txt
12:01:00 file1.txt
12:02:11 file1.txt
12:01:50 file2.txt
12:02:11 file2.txt
12:01:50 file3.txt
12:02:11 file3.txt
12:03:50 file3.txt
output:
Filename
file1.txt 3
file2.txt 2
file3.txt 3
Name: Time, dtype: object
so far this is my code
import pandas as pd
from pathlib import Path
folder = Path(r"C:\\Users\\CL\\Desktop\\Script") # path to your folder
files = list(folder.rglob("*.txt")) #get all the txt file
df = pd.DataFrame()
pattern = r"(\d{2}):(\d{2}):(\d{2})" #pattern to read
for file in files:
_df = pd.read_csv(file, dtype=str, names=['Time'])
_df['Filename'] = os.path.split(file)[-1]
df = df.append(_df)
filtered_df = df[df.Time.str.match(pattern)]
#sum non-duplicate timing
new_df = filtered_df.groupby("Filename")["Time"].apply(lambda d: d.str[:5].nunique())
#WEIRD output when i testing to add new column
new_df['Filename'] = filtered_df['Filename']
print(new_df)
'weird' output after adding new col:
Filename
1.txt 3
2.txt 2
3.txt 3
Filename 2 2.txt
5 2.txt
2 11...
Name: Duration, dtype: object
update edits 12/9/21
import pandas as pd
from pathlib import Path
folder = Path(r"C:\\Users\\CL\\Desktop\\Script") # path to your folder
files = list(folder.rglob("*.txt")) #get all the txt file
df = pd.DataFrame()
pattern = r"(\d{2}):(\d{2}):(\d{2})" #pattern to read
for file in files:
_df = pd.read_csv(file, dtype=str, names=['Time'])
_df['Filename'] = os.path.split(file)[-1]
_df['Path'] = "_".join(file.parts[-5:]) #new code - to get file path
df = df.append(_df)
filtered_df = df[df.Time.str.match(pattern)]
new_df = filtered_df.groupby(["Filename","Path"])["Duration"].apply(lambda d: d.str[:5].nunique()).reset_index()
df.to_csv('new_df.csv', index=False) #saved in python folder
output on jupyter
Filename Duration Path
0 1.txt 3 2021_01_01_cat_1.txt
1 2.txt 2 2021_01_01_cat_2.txt
2 3.txt 3 2021_01_01_cat_3.txt
output on csv
#shows unfiltered data
Upvotes: 1
Views: 653
Reputation: 23217
The problem you face is owing to the resulting dataframe
is NOT a Pandas dataframe but a Pandas Series.
You can try to get your result as dataframe by either:
as_index=False
to your .groupby
statement, orreset_index()
after it:new_df = filtered_df.groupby("Filename", as_index=False)["Time"].apply(lambda d: d.str[:5].nunique())
or
new_df = filtered_df.groupby("Filename")["Time"].apply(lambda d: d.str[:5].nunique()).reset_index()
Upvotes: 2