Extract data Date-wise using Python Code & Save as separate csv Datewise

I have to extract data datewise and save as separate csv for each different dates: "time" is given in this format(2018-03-26T16:09:10.024101278Z) in one column of CSV file.

This Dataset has more than 100k rows taken in a different time. "I have tried making a data frame" '''Column name: (name time id ddr version readings) for reference'''

dataset_CT= pd.read_csv("out_1.csv")
dataset_CT['Dates'] = pd.to_datetime(dataset_CT['time']).dt.date
dataset_CT['Time'] = pd.to_datetime(dataset_CT['time']).dt.time
dataset_CT.sort_values(by='Dates', axis=0, inplace=True)
dataset_CT.set_index(keys=['Dates'], drop=False,inplace=True)

Date_list=dataset_CT['Dates'].unique().tolist()

"got Date_list like this([datetime.date(2018, 3, 26), datetime.date(2018, 3, 31)])"

Date_set = dataset_CT.loc[dataset_CT.Dates=='(2018, 3, 26)']

I received empty Dataframe like below

      name  time id ddr version readings Dates  Time
Dates

Upvotes: 1

Views: 1360

Answers (2)

Serge Ballesta
Serge Ballesta

Reputation: 148880

As you read your input with default parameters, I will assume that you have comma (,) for separator and one header line. IMHO pandas for that is not required here. It is enough to read the file one row at time and write it in a csv file corresponding to the date.

The caveats: add the header to each output csv file and create a new output file for every new date. A collections.defaultdict with a custom default function is enough to meet those 2 requirement.

The following code reads an input csv file named "out_1.csv" and writes it content in a bunch of files named out_2018-03-26.csv the date being the date of all rows in the output file:

with open("out_1.csv") as fdin:
    def get_defaults():
        """returns a pair (csv_writer, file_object) for date dat initialized with header"""
        filename = 'out{}.csv'.format(dat)
        fd = open(filename, "w", newline='')
        fd.write(header)
        return (csv.writer(fd), fd)
    outfiles = collections.defaultdict(get_defaults)
    rd = csv.reader(fdin)
    header = next(fdin)             # store the header to later initialize output files
    for row in rd:
        dat = row[1][:10]           # extract the date
        wr = outfiles[dat][0]
        wr.writerow(row)            # and write the row to the appropriate output file
    # close the output files
    for i in outfiles:
        outfile[i][1].close()

After a second thinking about it, above code could keep too many open files. Here is an improved version that only keep open files for the 3 most recently encountered dates (tested):

with open("out_1.csv") as fdin:
    cache = collections.deque()
    seen = set()
    def get_defaults():
        """returns a pair (csv_writer, file_object) for date dat initialized with header"""
        filename = 'out{}.csv'.format(dat)
        fd = open(filename, 'a' if dat in seen else 'w', newline='')
        if 0 == fd.tell():          # file is currently empty: write header
            fd.write(header)
        ret = (csv.writer(fd), fd)
        cache.append(dat)
        seen.add(dat)
        if len(cache) > 3:          # only keep 3 open files
            old = cache.popleft()
            print("Closing", old)
            outfiles[old][1].close()
            del outfiles[old]
        return ret

    outfiles = collections.defaultdict(get_defaults)
    rd = csv.reader(fdin)
    header = next(fdin)   # store the header to later initialize output files
    for row in rd:
        dat = row[1][:10] # extract the date
        wr = outfiles[dat][0]
        wr.writerow(row)  # and write the row to the appropriate output file
    # close the currently opened output files
    for i in outfiles:
        outfiles[i][1].close()

Upvotes: 1

jezrael
jezrael

Reputation: 862511

How working compare by string?

Date_set = dataset_CT.loc[dataset_CT.Dates=='2018-03-26']

If not working, try to changed Series.dt.date:

dataset_CT['Dates'] = pd.to_datetime(dataset_CT['time']).dt.date
Date_set = dataset_CT.loc[dataset_CT.Dates=='2018-03-26']

to Series.dt.floor for datetimes with no times:

dataset_CT['Dates'] = pd.to_datetime(dataset_CT['time']).dt.floor('d')
Date_set = dataset_CT.loc[dataset_CT.Dates=='2018-03-26']

Upvotes: 1

Related Questions