Hannan
Hannan

Reputation: 1191

Django: upload csv file and stays in memory

I'm trying to build a web app using Django where the user will upload some csv file, possibly a big one. Then the code will clean the file for bad data and then the user can use it to make queries with clean data.

Now I believe whenever a user will make a query, the whole code will run again which means it will start cleaning again and so on.

Question:

Is there any way that once the csv data in clean, it stays in memory and user can make queries to that clean data?

import pandas as pd

def converter(num):
    try:
        return float(num)
    except ValueError:
        try:
            num = num.replace("-", '0.0').replace(',', '')
            return float(num)
        except ValueError:
            return np.nan

def get_clean_data(request):
    # Read the data from csv file:
    df = pd.read_csv("data.csv")

    # Clean the data and send JSON response
    df['month'] = df['month'].str.split("-", expand=True)[1]
    df[df.columns[8:]] = df[df.columns[8:]].astype(str).applymap(converter)
    selected_year = df[df["Departure Datetime: Year (YYYY)"] == 2015]
    data_for_user = (selected_year.groupby(
        by="route").sum().sort_values(by="revenue").to_json()

    return JsonResponse(data_for_user, safe=False)

Upvotes: 3

Views: 998

Answers (1)

Will Keeling
Will Keeling

Reputation: 22994

One way to achieve this could be to cache the dataframe in memory after it has been cleaned. Subsequent requests could then use the cleaned version from the cache.

from django.core.cache import cache

def get_clean_data(request):
    # Check the cache for cleaned data
    df = cache.get('cleaned_data')

    if df is None:
        # Read the data from csv file:
        df = pd.read_csv("data.csv")

        # Clean the data
        df['month'] = df['month'].str.split("-", expand=True)[1]
        df[df.columns[8:]] = df[df.columns[8:]].astype(str).applymap(converter)

        # Put it in the cache
        cache.set('cleaned_data', df, timeout=600)

    selected_year = df[df["Departure Datetime: Year (YYYY)"] == 2015]
    data_for_user = (selected_year.groupby(
        by="route").sum().sort_values(by="revenue").to_json()

    return JsonResponse(data_for_user, safe=False)

You'd need to be a little bit careful, because if the csv file is very large it may consume a large amount of memory when cached.

Django supports a number of different cache backends, from simple local memory caching, to more complex memcached caching.

Upvotes: 2

Related Questions