Reputation: 287
I have to extract data datewise and save as separate csv for each different dates: "time" is given in this format(2018-03-26T16:09:10.024101278Z) in one column of CSV file.
This Dataset has more than 100k rows taken in a different time. "I have tried making a data frame" '''Column name: (name time id ddr version readings) for reference'''
dataset_CT= pd.read_csv("out_1.csv")
dataset_CT['Dates'] = pd.to_datetime(dataset_CT['time']).dt.date
dataset_CT['Time'] = pd.to_datetime(dataset_CT['time']).dt.time
dataset_CT.sort_values(by='Dates', axis=0, inplace=True)
dataset_CT.set_index(keys=['Dates'], drop=False,inplace=True)
Date_list=dataset_CT['Dates'].unique().tolist()
"got Date_list like this([datetime.date(2018, 3, 26), datetime.date(2018, 3, 31)])"
Date_set = dataset_CT.loc[dataset_CT.Dates=='(2018, 3, 26)']
I received empty Dataframe like below
name time id ddr version readings Dates Time
Dates
Upvotes: 1
Views: 1360
Reputation: 148880
As you read your input with default parameters, I will assume that you have comma (,
) for separator and one header line. IMHO pandas for that is not required here. It is enough to read the file one row at time and write it in a csv file corresponding to the date.
The caveats: add the header to each output csv file and create a new output file for every new date. A collections.defaultdict
with a custom default function is enough to meet those 2 requirement.
The following code reads an input csv file named "out_1.csv"
and writes it content in a bunch of files named out_2018-03-26.csv
the date being the date of all rows in the output file:
with open("out_1.csv") as fdin:
def get_defaults():
"""returns a pair (csv_writer, file_object) for date dat initialized with header"""
filename = 'out{}.csv'.format(dat)
fd = open(filename, "w", newline='')
fd.write(header)
return (csv.writer(fd), fd)
outfiles = collections.defaultdict(get_defaults)
rd = csv.reader(fdin)
header = next(fdin) # store the header to later initialize output files
for row in rd:
dat = row[1][:10] # extract the date
wr = outfiles[dat][0]
wr.writerow(row) # and write the row to the appropriate output file
# close the output files
for i in outfiles:
outfile[i][1].close()
After a second thinking about it, above code could keep too many open files. Here is an improved version that only keep open files for the 3 most recently encountered dates (tested):
with open("out_1.csv") as fdin:
cache = collections.deque()
seen = set()
def get_defaults():
"""returns a pair (csv_writer, file_object) for date dat initialized with header"""
filename = 'out{}.csv'.format(dat)
fd = open(filename, 'a' if dat in seen else 'w', newline='')
if 0 == fd.tell(): # file is currently empty: write header
fd.write(header)
ret = (csv.writer(fd), fd)
cache.append(dat)
seen.add(dat)
if len(cache) > 3: # only keep 3 open files
old = cache.popleft()
print("Closing", old)
outfiles[old][1].close()
del outfiles[old]
return ret
outfiles = collections.defaultdict(get_defaults)
rd = csv.reader(fdin)
header = next(fdin) # store the header to later initialize output files
for row in rd:
dat = row[1][:10] # extract the date
wr = outfiles[dat][0]
wr.writerow(row) # and write the row to the appropriate output file
# close the currently opened output files
for i in outfiles:
outfiles[i][1].close()
Upvotes: 1
Reputation: 862511
How working compare by string?
Date_set = dataset_CT.loc[dataset_CT.Dates=='2018-03-26']
If not working, try to changed Series.dt.date
:
dataset_CT['Dates'] = pd.to_datetime(dataset_CT['time']).dt.date
Date_set = dataset_CT.loc[dataset_CT.Dates=='2018-03-26']
to Series.dt.floor
for datetimes with no times:
dataset_CT['Dates'] = pd.to_datetime(dataset_CT['time']).dt.floor('d')
Date_set = dataset_CT.loc[dataset_CT.Dates=='2018-03-26']
Upvotes: 1