Reputation: 49
I want to calculate word statistics on pages in my knowledge base that runs on Confluence.
But before I do the calculations, I'd like to retrieve page data: text written on the pages.
I have a Python script that was originally made to collect commentary from pages. I am trying to adapt the script for /rest/api/content/{id} REST API that I have found with Confluence REST browser.
The original script uses API that returns result as a JSON object that when parsed with json() method returns dictionary objects.
However, the /rest/api/content/{id} API returns a result that does not contain a well formed dictionary. I receive string objects, and I cannot simply address them as array['index'] = result ['value'] to retrieve page data.
I am using JupyterLab environment to run the code.
When using Confluecne Browser and /rest/api/content/{id} API for the page 4068365, Confluence returns the following result:
{
"id": "4068365",
"type": "page",
"status": "current",
"title": "Page title",
"body": {
"view": {
"value": "<p>Some text</p>",
"representation": "storage",
"_expandable": {
"webresource": "",
"content": "/rest/api/content/4068365"
}
},
"_expandable": {
"editor": "",
"export_view": "",
"styled_view": "",
"storage": "",
"anonymous_export_view": ""
}
},
"extensions": {
"position": "none"
},
...
I'd like to obtain the value of 'value' key. However, the 'value' param is not recognized as a key because the result is formatted as a string and not as a dictionary.
Here's the code I have.
import requests
import json
import getpass
import re
import html
import pandas as pd
from datetime import datetime
# Allow HTTPS connections with self-signed cert
requests.packages.urllib3.disable_warnings()
# Create login session for Confluence
auth = ('mylogin', getpass.getpass())
s = requests.Session()
s.auth = auth
s.verify = False
s.headers = {"Content-Type": "application/json"}
# Confluence REST API URI
WIKI = 'https://example.net/wiki/rest/api/'
# Obtain text from Confluence HTML layout
def cleanhtml(raw_html):
cleanr = re.compile('<.*?>')
text = html.unescape(raw_html)
text = re.sub(cleanr, '', text)
text = text.replace(u'\xa0', u' ')
return text
# Retrieving page data
def get_data(page_id):
data = []
r = s.get(
'{}content/{}'.format(WIKI, page_id),
params = dict(
expand='body.view'
)
)
for content in r.json():
pgdata = dict()
#I can't address to value as content['value']
pgdata['text'] = cleanhtml(content['body']['view'].get('value'))
data.append(pgdata)
return data
# Pages to extract from
with open(r'C:\\Users\\Stacy\\Documents\\pages.txt') as pagesf:
pagesl = pagesf.read()
pages = pagesl.split(",\n")
print(pages)
# Preparing data frame and exporting to Excel
textdata = list()
for page in pages:
print('Handing:', page)
textdata.extend(get_data(page))
df = pd.DataFrame(
textdata,
columns = ['text']
)
df.to_excel('page_data{}.xlsx'.format(datetime.now().strftime("%Y_%m_%d_%H-%M")))
I want to collect the text from
"value": "<p>Some text</p>",
into data and store all that in a dictionary. However, I see that content contains data type and not data, so I can't reference to 'body' as to a key, because it's not a key.
Please help me retrieve page data from 'value'. What would be the right way? Thank you.
Upvotes: 0
Views: 9530
Reputation: 49
Here's the solution that I have come to:
def get_words(page_id):
comments = []
r = s.get(
'{}content/{}'.format(WIKI, page_id),
params = dict(
expand='body.view'
)
)
for cmnt in r: # No valid json, so we scan the result
comments.append(cmnt) # Collect all strings into a list
bytes = [] #Results are encoded, store decoded data in a list
for byte in comments:
byted = byte.decode('utf-8', 'ignore') #Decode as UTF-8 and ignore errors
bytes.append(byted)
bytesstr = "".join(bytes) # List contains split strings, join them together into a single line
parsed = json.loads(bytesstr); # Convert the line into a valid JSON object
pgdata = dict() # Preparing dictionary to store extracted text
pgdata['value'] = parsed['body']['view'].get('value') # Retrieving text from the page
pgdatac = cleanhtml(pgdata['value']) # Removing HTML tags
counts = len(re.findall(r'\w+', pgdatac)) # Extra line to calculate words on a page
print(counts)
Upvotes: 0