WJA
WJA

Reputation: 7004

Dealing with lots of data in Firebase for a recommender system

I am building a recommender system where I use Firebase to store and retrieve data about movies and user preferences.

Each movie can have several attributes, and the data looks as follows:

{ 
    "titanic": 
    {"1997": 1, "english": 1, "dicaprio": 1,    "romance": 1, "drama": 1 }, 
    "inception": 
    { "2010": 1, "english": 1, "dicaprio": 1, "adventure": 1, "scifi": 1}
...
}

To make the recommendations, my algorithm requires as input all the data (movies) and is matched against an user profile.

However, in production mode I need to retrieve over >10,000 movies. While the algorithm can handle this relatively fast, it takes a lot of time to load this data from Firebase.

I retrieve the data as follows:

firebase.database().ref(moviesRef).on('value', function(snapshot) {
    // snapshot.val();
}, function(error){
    console.log(error)
});

I am there wondering if you have any thoughts on how to speed things up? Are there any plugins or techniques known to solve this?

I am aware that denormalization could help split the data up, but the problem is really that I need ALL movies and ALL the corresponding attributes.

Upvotes: 18

Views: 5721

Answers (3)

Uddhav P. Gautam
Uddhav P. Gautam

Reputation: 7626

Firebase NoSQL JSON structure best practice is to "Avoid nesting data", but you said, you don't want to change your data. So, for your condition, you can have REST call to any particular node (node of your each movie) of the firebase.

Solution 1) You can create some fixed number of Threads via ThreadPoolExecutors. From each worker thread, you can do HTTP (REST call request) as below. Based on your device performance and memory power, you can decide how many worker threads you want to manipulate via ThreadPoolExecutors. You can have code snippet something like below:

/* creates threads on demand */
    ThreadFactory threadFactory = Executors.defaultThreadFactory(); 

/* Creates a thread pool that creates new threads as needed, but will reuse previously constructed threads when they are available */

    ExecutorService threadPoolExecutor = Executors.newFixedThreadPool(10); /* you have 10 different worker threads */  

for(int i = 0; i<100; i++) { /* you can load first 100 movies */
/* you can use your 10 different threads to read first 10 movies */
threadPoolExecutor.execute(() -> {



        /* OkHttp Reqeust */
        /* urlStr can be something like "https://earthquakesenotifications.firebaseio.com/movies?print=pretty" */
                Request request = new Request.Builder().url(urlStr+"/i").build(); 

    /* Note: Firebase, by default, store index for every array. 
Since you are storing all your movies in movies JSON array, 
it would be easier, you read first (0) from the first worker thread, 
second (1) from the second worker thread and so on. */

                try {
                    Response response = new OkHttpClient().newCall(request).execute(); 
    /* OkHttpClient is HTTP client to request */
                    String str = response.body().string();
                } catch (IOException e) {
                    e.printStackTrace();
                }
                return myStr;
            });
            }
                threadPoolExecutor.shutdown();

Solution 2) Solution 1 is not based on the Listener-Observer pattern. Actually, Firebase has PUSH technology. Means, whenever something particular node changes in Firebase NoSQL JSON, the corresponding client, who has connection listener for particular node of the JSON, will get new data via onDataChange(DataSnapshot dataSnapshot) { }. For this you can create an array of DatabaseReferences like below:

      Iterable<DataSnapshot> databaseReferenceList = FirebaseDatabase.getInstance().getReference().getRoot().child("movies").getChildren();

for(DataSnapshot o : databaseReferenceList) { 
 @Override
            public void onDataChange(DataSnapshot o) {



      /* show your ith movie in ListView. But even you use RecyclerView, showing each Movie in your RecyclerView's item is still show. */
/* so you can store movie in Movies ArrayList. When everything completes, then you can update RecyclerView */

                }

            @Override
            public void onCancelled(DatabaseError databaseError) {
            }
}

Upvotes: 3

johnozbay
johnozbay

Reputation: 2222

My suggestion would be to use Cloud Functions to handle this.

Solution 1 (Ideally)

If you can calculate suggestions every hour / day / week

You can use a Cloud Functions Cron to fire up daily / weekly and calculate recommendations per users every week / day. This way you can achieve a result more or less similar to what Spotify does with their weekly playlists / recommendations.

The main advantage of this is that your users wouldn't have to wait for all 10,000 movies to be downloaded, as this would happen in a cloud function, every Sunday night, compile a list of 25 recommendations, and save into your user's data node, which you can download when the user accesses their profile.

Your cloud functions code would look like this :

var movies, allUsers; 

exports.weekly_job = functions.pubsub.topic('weekly-tick').onPublish((event) => {
  getMoviesAndUsers();
});  

function getMoviesAndUsers () {
  firebase.database().ref(moviesRef).on('value', function(snapshot) {
    movies = snapshot.val();
    firebase.database().ref(allUsersRef).on('value', function(snapshot) {
        allUsers = snapshot.val();
        createRecommendations();
    });
});
}

function createRecommendations () {
  // do something magical with movies and allUsers here.

  // then write the recommendations to each user's profiles kind of like 
  userRef.update({"userRecommendations" : {"reco1" : "Her", "reco2", "Black Mirror"}});
  // etc. 
}

Forgive the pseudo-code. I hope this gives an idea though.

Then on your frontend you would have to get only the userRecommendations for each user. This way you can shift the bandwidth & computing from the users device to a cloud function. And in terms of efficiency, without knowing how you calculate recommendations, I can't make any suggestions.

Solution 2

If you can't calculate suggestions every hour / day / week, and you have to do it each time user accesses their recommendations panel

Then you can trigger a cloud function every time the user visits their recommendations page. A quick cheat solution I use for this is to write a value into the user's profile like : {getRecommendations:true}, once on pageload, and then in cloud functions listen for changes in getRecommendations. As long as you have a structure like this :

userID > getRecommendations : true

And if you have proper security rules so that each user can only write to their path, this method would get you the correct userID making the request as well. So you will know which user to calculate recommendations for. A cloud function could most likely pull 10,000 records faster and save the user bandwidth, and finally would write only the recommendations to the users profile. (similar to Solution 1 above) Your setup would like this :

[Frontend Code]

//on pageload
userProfileRef.update({"getRecommendations" : true});
userRecommendationsRef.on('value', function(snapshot) {  gotUserRecos(snapshot.val());  });

[Cloud Functions (Backend Code)]

exports.userRequestedRecommendations = functions.database.ref('/users/{uid}/getRecommendations').onWrite(event => {
  const uid = event.params.uid;
  firebase.database().ref(moviesRef).on('value', function(snapshot) {
    movies = snapshot.val();
    firebase.database().ref(userRefFromUID).on('value', function(snapshot) {
        usersMovieTasteInformation = snapshot.val();
        // do something magical with movies and user's preferences here.
        // then 
        return userRecommendationsRef.update({"getRecommendations" : {"reco1" : "Her", "reco2", "Black Mirror"}});
    });
  });
});

Since your frontend will be listening for changes at userRecommendationsRef, as soon as your cloud function is done, your user will see the results. This might take a few seconds, so consider using a loading indicator.

P.S 1: I ended up using more pseudo-code than originally intended, and removed error handling etc. hoping that this generally gets the point across. If there's anything unclear, comment and I'll be happy to clarify.

P.S. 2: I'm using a very similar flow for a mini-internal-service I built for one of my clients, and it's been happily operating for longer than a month now.

Upvotes: 22

Heyji
Heyji

Reputation: 1213

Although you stated your algorithm needs all the movies and all attributes, it does not mean that it processes them all at once. Any computation unit has its limits, and within your algorithm, you probably chunk the data into smaller parts that your computation unit can handle.

Having said that, if you want to speed things up, you can modify your algorithm to parallelize fetching and processing of the data/movies:

| fetch  | -> |process | -> | fetch  | ...
|chunk(1)|    |chunk(1)|    |chunk(3)|

(in parallel) | fetch  | -> |process | ...
              |chunk(2)|    |chunk(2)|

With this approach, you can spare almost the whole processing time (but the last chunk) if processing is really faster than fetching (but you have not said how "relatively fast" your algorithm run, compared to fetching all the movies)

This "high level" approach of your problem is probably your better chance if fetching the movies is really slow although it requires more work than simply activating a hypothetic "speed up" button of a Library. Though it is a sound approach when dealing with large chunk of data.

Upvotes: 2

Related Questions