Reputation: 11
My idea on how to painstakingly read the firefox cache, to search for image URLs, where the host matches a regex pattern, is to retrieve all content of cache files under ~/.cache/mozilla/firefox/[]/cache2, then filter only the lines containing the image URL.
Here's an example of what a cache file looks like:
.2.U.2.;...V.... 57 00000380: 0000 513a 6874 7470 3a2f 2f77 7777 2e74 ..Q:http://www.t 58 00000390: 6563 686e 6970 6167 6573 2e63 6f6d 2f77 echnipages.com/w 59 000003a0: 702d 636f 6e74 656e 742f 706c 7567 696e p-content/plugin 60 000003b0: 732f 7961 7369 702f 696d 6167 6573 2f64 s/yasip/images/d 61 000003c0: 6566 6175 6c74 2f72 7373 5f33 3278 3332 efault/rss_32x32 62 000003d0: 2e70 6e67 006e 6563 6b6f 3a63 6c61 7373 .png.necko:class 63 000003e0: 6966 6965 6400 3100 7265 7175 6573 742d
Because these cache files seem to be binary files, I would then set a pointer/whatever to the 'h' of http, and read ahead as long as the hexadecimal value of the next letter is 00, which on the ascii table appears to be '\0'.
To prevent dublicates I'd write these URLs to a file, and everytime a new URL is found, I'd first have all entries of the file checked, to see if the URL already exists.
Is this the easiest possible approach, or am I missing something here? I don't want to be using other apps/extensions for this task.
Thanks
Upvotes: 1
Views: 1610
Reputation: 46779
The following should do roughly what you are trying to achieve:
import glob, re
cache_folder = r"~/.cache/mozilla/firefox/[]/cache2\*"
urls = set()
for cache_filename in glob.glob(cache_folder):
with open(cache_filename, 'rb') as file_cache:
data = file_cache.read()
urls |= set(re.findall("(http.*?)\x00", data))
for url in urls:
print url
This reads each of the files found in your cache folder and gets a list of all of the URLs in each. It then stores all of these matching URLs into a set
to avoid any duplication. It then displays the set entries.
You could also consider researching the format of these files.
Upvotes: 1