Reputation: 53
Is there a way to perform a word count of only markdown cells in Jupyter Notebook, and if possible within the notebook itself? Thanks
Edit: Appears that doing it within the notebook is rather complicated, I'm happy with just with an external solution
Upvotes: 2
Views: 4464
Reputation: 9856
A Jupyter notebook is just a JSON file (.ipynb
-file). We could parse this JSON using Python and filter for cells with 'cell_type': 'markdown'
and reduce the source
content to a word count.
For parsing a JSON file we can just use the builtin JSON encover/decoder library as follows.
import json
with open('test.ipynb') as json_file:
data = json.load(json_file)
print(data)
whereby test.ipynb
is a Jupyter notebook with two code cells and two markdown cells. The output of data
is as follows.
{
"cells":[
{
"cell_type":"markdown",
"metadata":{
},
"source":[
"# This is a markdown file\n",
"Hello World"
]
},
{
"cell_type":"code",
"execution_count":2,
"metadata":{
},
"outputs":[
{
"name":"stdout",
"output_type":"stream",
"text":[
"Hello World\n"
]
}
],
"source":[
"print(\"Hello World\")"
]
},
{
"cell_type":"code",
"execution_count":3,
"metadata":{
},
"outputs":[
{
"name":"stdout",
"output_type":"stream",
"text":[
"Hello World 2\n"
]
}
],
"source":[
"print(\"Hello World 2\")"
]
},
{
"cell_type":"markdown",
"metadata":{
},
"source":[
"## More markdown\n",
"hello"
]
}
],
"metadata":{
"interpreter":{
"hash":"e7370f93d1d0cde622a1f8e1c04877d8463912d04d973331ad4851f04de6915a"
},
"kernelspec":{
"display_name":"Python 3.10.2 64-bit",
"language":"python",
"name":"python3"
},
"language_info":{
"codemirror_mode":{
"name":"ipython",
"version":3
},
"file_extension":".py",
"mimetype":"text/x-python",
"name":"python",
"nbconvert_exporter":"python",
"pygments_lexer":"ipython3",
"version":"3.10.2"
},
"orig_nbformat":4
},
"nbformat":4,
"nbformat_minor":2
}
A possible function, that retrieves all strings from source
mappings from cells of type markdown
could look as follows.
wordCount = 0
for each in data['cells']:
cellType = each['cell_type']
if cellType == "markdown":
content = each['source']
for line in content:
temp = [word for word in line.split() if "#" not in word] # we might need to filter for more markdown keywords here
wordCount = wordCount + len(temp)
print(wordCount)
Upvotes: 2