tim-hy
tim-hy

Reputation: 53

Word count of Markdown cells in Jupyter Notebook

Is there a way to perform a word count of only markdown cells in Jupyter Notebook, and if possible within the notebook itself? Thanks

Edit: Appears that doing it within the notebook is rather complicated, I'm happy with just with an external solution

Upvotes: 2

Views: 4464

Answers (1)

David Scholz
David Scholz

Reputation: 9856

A Jupyter notebook is just a JSON file (.ipynb-file). We could parse this JSON using Python and filter for cells with 'cell_type': 'markdown' and reduce the source content to a word count.

For parsing a JSON file we can just use the builtin JSON encover/decoder library as follows.

import json

with open('test.ipynb') as json_file:
    data = json.load(json_file)

print(data)

whereby test.ipynb is a Jupyter notebook with two code cells and two markdown cells. The output of data is as follows.

{
   "cells":[
      {
         "cell_type":"markdown",
         "metadata":{
            
         },
         "source":[
            "# This is a markdown file\n",
            "Hello World"
         ]
      },
      {
         "cell_type":"code",
         "execution_count":2,
         "metadata":{
            
         },
         "outputs":[
            {
               "name":"stdout",
               "output_type":"stream",
               "text":[
                  "Hello World\n"
               ]
            }
         ],
         "source":[
            "print(\"Hello World\")"
         ]
      },
      {
         "cell_type":"code",
         "execution_count":3,
         "metadata":{
            
         },
         "outputs":[
            {
               "name":"stdout",
               "output_type":"stream",
               "text":[
                  "Hello World 2\n"
               ]
            }
         ],
         "source":[
            "print(\"Hello World 2\")"
         ]
      },
      {
         "cell_type":"markdown",
         "metadata":{
            
         },
         "source":[
            "## More markdown\n",
            "hello"
         ]
      }
   ],
   "metadata":{
      "interpreter":{
         "hash":"e7370f93d1d0cde622a1f8e1c04877d8463912d04d973331ad4851f04de6915a"
      },
      "kernelspec":{
         "display_name":"Python 3.10.2 64-bit",
         "language":"python",
         "name":"python3"
      },
      "language_info":{
         "codemirror_mode":{
            "name":"ipython",
            "version":3
         },
         "file_extension":".py",
         "mimetype":"text/x-python",
         "name":"python",
         "nbconvert_exporter":"python",
         "pygments_lexer":"ipython3",
         "version":"3.10.2"
      },
      "orig_nbformat":4
   },
   "nbformat":4,
   "nbformat_minor":2
}

A possible function, that retrieves all strings from source mappings from cells of type markdown could look as follows.

wordCount = 0
for each in data['cells']:
    cellType = each['cell_type']
    if cellType == "markdown":
        content = each['source']
        for line in content:
            temp = [word for word in line.split() if "#" not in word] # we might need to filter for more markdown keywords here
            wordCount = wordCount + len(temp)
            
print(wordCount)

Upvotes: 2

Related Questions