比尔盖子
比尔盖子

Reputation: 3637

Work with Netscape Bookmark File Format on Python?

My Chrome bookmark is too messy. So I export it, and decided to write a Python Program to clean my bookmark. For example: Sort them by the keyword.

I found Beautiful Soup. But the problem is, the export file is using Netscape Bookmark File Format, not standard XML. Beautiful Soup will try to convert them to standard XTML format. Chrome will unable to read it.

Is there another solution?

Upvotes: 2

Views: 3660

Answers (3)

Aronanda
Aronanda

Reputation: 301

I figured out how to do this with Node.js. Just install cheerio (npm install -S cheerio) and add the names of the inputFile and outputFile, either through environment variables or command line arguments. Here's my solution:

const fs = require('fs')
const path = require('path')
const cheerio = require('cheerio')
const inputFile = process.env.INPUT || process.argv[2] || 'bookmarks.html'
const outputFile = process.env.OUTPUT || process.argv[3] || 'bookmarks.json'
const inputFilePath = path.resolve(inputFile)
const outputFilePath = path.resolve(outputFile)

fs.readFile(inputFilePath, { encoding: 'utf8' }, (error, data) => {
  if (error)
    return console.error(error)
  const $ = cheerio.load(data)
  function parseTerm(element, out) {
    const item = {}
    if (element.name === 'dt') {
      parseTerm($(element).children(':not(p)').first().get()[0], out)
    } else if (element.name === 'h3') {
      item.title = $(element).text()
      item.type = 'folder'
      item.updated = $(element).attr('last_modified')
      item.children = []
      out.push(item)
      parseList($(element).next(), item.children)
    } else if (element.name === 'a') {
      item.title = $(element).text()
      item.type = 'link'
      item.added = $(element).attr('add_date')
      item.href = $(element).attr('href')
      item.icon = $(element).attr('icon')
      out.push(item)
    }
  }
  function parseList(list, out) {
    list.children(':not(p)').each(function (index) {
      parseTerm(this, out)
    })
  }
  const out = []
  parseList($('dl').first(), out)
  fs.writeFile(outputFilePath, JSON.stringify(out, null, 2), error => {
    if (error)
      return console.error(error)
    console.log('Success!')
  })
})

Upvotes: 0

Allen Galler
Allen Galler

Reputation: 55

I have the same problem. Right now I am doing a Python Bookmark Toolkit just for cleaning my messy bookmark from Chrome.

Bookmarkit at github: https://github.com/allengaller/bookmarkit

I think locating the bookmark file with Chrome does not help you/me. Unless you parsing the JSON file to Dict(I see you opened another question about this, and I think you already leave the SGML bookmark file along.)

My solution will be:

  1. Using CLI to manage bookmark is dead-end, because it is a very hard progress for people who really need a tool JUST for manage the bookmark(most of them have 10M+ bookmark file like me), I will use PyGTK or PyQT to provide easy drop-and-throw-based GUI.

  2. About BS changing your file: Forget about the changing that BS will do to your bookmark file. Every time you finish parsing the file, generate a NETSCAPE-BOOKMARK file, not using the original file(even it hasn't been changed)

  3. Try ElementTree lib.
    See here: http://docs.python.org/library/xml.etree.elementtree.html I think parsing SGML is much more safer than directly change the JSON file that Chrome are using. Because heavy user like me take our data very seriously, I would rather export carefully, import to my Toolkit, finish my job, then import back to Chrome. This progress better be explicit.

Upvotes: 2

John McCollum
John McCollum

Reputation: 5142

By default, Chrome stores your bookmarks as JSON, for example at:

C:\Users\user\AppData\Local\Google\Chrome\User Data\Default\Bookmarks

For Linux users:

~/.config/chrome/Default/Bookmarks

(The location of this file will vary depending on your platform of course.)

You might find this file easier to manipulate than an HTML export.

Upvotes: 7

Related Questions