Reputation: 134
I've built a scraper that gets product data from different shopping websites.
When I run python scraper.py
the program will print a JSON object containing all the data like this:
{ 'ebay': [ { 'advertiser': 'ebay',
'advertiser_url': 'https://rover.ebay.com/rover/1/711-53200-19255-0/1?ff3=2&toolid=10041&campid=5338482617&customid=&lgeo=1&vectorid=229466&item=302847614914',
'description': '30-Day Warranty - Free Charger & Cable - '
'Easy Returns!',
'main_image': 'https://thumbs1.ebaystatic.com/pict/04040_0.jpg',
'price': '290.0',
'title': 'Apple iPhone 8 Plus Smartphone AT&T Sprint '
'T-Mobile Verizon or Unlocked 4G LTE'}
]}
I want this data to be added to the database automatically every time I run the scraper.
Here's my database structure:
models.py
class Product(models.Model):
similarity_id = models.CharField(max_length=255, blank=True, null=True)
name = models.CharField(max_length=255, blank=True, null=True)
url = models.SlugField(blank=True, unique=True, allow_unicode=True)
advertiser_url = models.TextField(blank=True, null=True)
main_image = models.TextField(blank=True, null=True)
second_image = models.TextField(blank=True, null=True)
third_image = models.TextField(blank=True, null=True)
old_price = models.FloatField(default=0.00)
price = models.FloatField(default=0.00)
discount = models.FloatField(default=0.00)
currency = models.CharField(max_length=255, default="$")
description = models.TextField(blank=True, null=True)
keywords = models.CharField(max_length=255, blank=True, null=True)
asin = models.CharField(max_length=80, blank=True, null=True)
iban = models.CharField(max_length=255, blank=True, null=True)
sku = models.CharField(max_length=255, blank=True, null=True)
seller = models.CharField(max_length=255, blank=True, null=True)
free_shipping = models.BooleanField(default=False)
in_stock = models.BooleanField(default=True)
sold_items = models.IntegerField(default=0)
likes_count = models.IntegerField(default=0)
category = models.CharField(max_length=255, blank=True, null=True)
sub_category = models.CharField(max_length=255, blank=True, null=True)
reviews_count = models.IntegerField(default=0)
rating = models.FloatField(default=0)
active = models.BooleanField(default=True)
is_prime = models.BooleanField(default=False)
created_on = models.DateTimeField(auto_now_add=True)
advertiser = models.CharField(max_length=255, blank=True, null=True)
objects = ProductManager()
class Meta:
verbose_name_plural = "products"
def __str__(self):
return self.name
Upvotes: 0
Views: 8270
Reputation: 7251
I work with json a lot; I have API caching where I receive a lot of json-based API data and I want to store it in a database for querying and caching. If you use postgres (for instance), you will see that if has extensions for json. This means that you can save json data in a special json field. But better, there are sql extensions that let you run queries on the json data. That is, postgres has "no sql" capabilities. This lets you work with json natively. I find it very compelling and I recommend it highly. It is a learning curve because it uses non-traditional sql, but heck, we have stackoverflow.
see: https://django-postgres-extensions.readthedocs.io/en/latest/json.html
here is a little example:
product_onhand_rows = DearCache.objects.filter(
object_type=DearObjectType.PRODUCT_AVAILABILITY.value).filter(
dear_account_id=self.dear_api.account_id).filter(jdata__Location=warehouse).filter(jdata__SKU=sku).all()
in this example, I have the json stored in a field jdata. jdata__Location accesses the key Location in the json. It nests and so on. For advanced queries, I resort to sql
select object_type,last_modified, jdata
from cached_dear_dearcache
where object_type = 'orders'
and jdata->>'Status' in ('ESTIMATING','ESTIMATED')
order by last_modified;
and there's more, you can 'unroll' lists (this is what I would call a complicated example, my json has lists of invoices, each of which has a list of lines...)
/* 1. listing invoice lines. We have to iterate over the array of invoices to get each invoice, and then inside the invoice object find the array of lines */
select object_type,last_modified, jsonb_array_elements(jsonb_array_elements(cached_dear_dearcache.jdata#>'{Invoices}')->'Lines') as lines,
jsonb_array_elements(cached_dear_dearcache.jdata#>'{Invoices}')->'InvoiceDate' as invoice_date,
jsonb_array_elements(cached_dear_dearcache.jdata#>'{Invoices}')->'InvoiceNumber' as invoice_number
from cached_dear_dearcache
where object_type = 'orders' order by last_modified;
Your approach is to convert the json data to a traditional sql model. That will work too. It's not very flexible ... if the json "schema" changes, your database schema may need to change. Philosophically, I think it is better to go with the flow, and use the json extensions, this is the best of both worlds. Performance is good, by the way.
Upvotes: 1
Reputation: 834
Add this to scrapper.py
:
import path.to.model
product = Product()
product.<key> = <value> #Where key is the field and value is the value you need to fill
and after you assign every field, add
product.save()
Trick
If all the keys in the json response match the fields in the model, you can do:
for k, v in response.items():
setattr(product, k, v)
product.save()
That will save you a lot of lines and time :)
Upvotes: 4