Thursday, September 10, 2020

Convert Confluence Documents to AEM/XML

 Atlassian's Confluence is a great tool for managing and sharing project documentation with your team and its seamless integration into Jira can help put everything into one place.  We've used Confluence and Jira for almost all documentation, from general project docs, to How-Tos, and even New Hire Onboarding. 

Recently I started going through the process of rewriting our training material for DevOps Engineers and found quite a lot of duplication in documentation. For some topics such as 'Git', Adobe Experience Manager, and other topics there were documents created for both front-end and back-end developers as well as Project Managers, QA Engineers, etc.

As I discovered all this duplication and extra work I thought, there has to be a better way, we have too many people creating and maintaining the same thing over and over with only slight variations. My first thought was to move common documents to a shared location and assign owners. This would enable different groups to cross-reference material already documented for their respective teams  The dev team would own and be responsible for creating and maintaining 'git' and code related topics, the devops team would be responsible for topics on AEM setup, configuration, troubleshooting, etc. 

While moving everything around in Confluence and assigning owners may help reduce the number of people updating documents, it doesn't solve the issue of ensuring the documentation fits the audience. Project Managers probably care very little about how to do a git merge and squash those commits, but may need to know general info, such as what git is, how it can be used on a project (without getting too far in to the technical weeds).

Through my searching to find a better way, I stumbled across Dita (Darwin Information Typing Architecture).  After reading up on it, I realized this is what I was looking for; a way to create documentation in a standardized way once and have it rendered to fit the needs of the audience. 

Much to my dismay, Confluence doesn't have a Dita plugin or support it directly. So this means I would either need to recreate all our documentation into the Dita format or find a way to easily convert it.  

Having worked on numerous AEM projects as a Full-stack developer, DevOps Engineer, and AEM Architect, I remembered there is an XML Documentation feature for AEM that should do what I want. But first, I need to export my content from Confluence.


Exporting Confluence Documents

Confluence makes it very easy to export a document or an entire space in multiple formats, from pdfs, MS Word documents, HTML, etc.

In this example, we are going to export the entire space, this will give us all parent pages, child pages, assets, styles, etc.  

To export a site in Confluence, go to: Space Settings -> Content Tools -> Export.



NOTE: While there is an export to xml option, this won't meet our needs. However the export to html option is perfect for what we want as the XML Documentation feature also provides some workflows to convert our html documents into Dita topics.

Select 'HTML', and click 'Next'. 

Confluence will gather up all our documents and assets, convert the docs to html and package in a zip file for us to download.


After downloading and unpacking our zip archive, examining the content we can see each page is represented in html, contains references to other html pages in our space, but it also contains ids, attributes, and even some elements that are very Confluence specific. 


Since we will be using AEM to render the documents, we don't need a lot of the class names, ids, and other bits Confluence added for us. It is also important to note, that our documents need to be in xhtml format before AEM will convert them to Dita. 

If we uploaded this document as-is we can expect nothing to happen,  this document would not be processed by the workflow. If we simply add the xml header to identify this document as an xhtml document, the workflow would attempt to process the document, but would fail with many errors. So we will need a way to pre-process them to clean them up.



Cleaning Up With Tidy

If you are not familiar with HTML Tidy, it is a great command line utility that can help cleanup and correct most errors in html and xml documents. While we are not expecting that we have any "bad" html, we know we will probably have some empty div elements, Confluence specific items, and since we are processing hundreds of documents we want to ensure they meet the xhtml standard and are as clean as possible without the need to manually go through each one individually correcting errors.


Create a Tidy Config

A Tidy config will help ensure all documents are pre-processed the same way. so that we have a nice uniform output. using your favorite text editor, create a config.txt file that will hold the configuration below.

clean: true

indent: auto

indent-spaces: 4

output-xhtml: true

add-xml-decl: true

add-xml-space: true

drop-empty-paras: true

drop-proprietary-attributes: true

bare: true

word-2000: true

new-blocklevel-tags: section

write-back: true

tidy-mark: false

merge-divs: true

merge-spans: true

enclose-text: true

To read more about what each of these settings does and other options available, check out the API doc page.

Instead of going over every option used above as most should be self-explanatory as to what they do, there are a few that need to be called out.

  • output-xhtml - Tells Tidy we want the output in xhtml, the format we need for AEM to process.
  • add-xml-decl - Adds the xml declaration to our output document
  • new-blocklevel-tags - Confluence adds a 'section' element to all out pages, this element does not conform to xhtml and Tidy will throw an error and refuse to process those docs unless we tell Tidy that it is an acceptable element. NOTE: This is a comma separated list of elements, so if you have others feel free to add them here. 
  • write-back - write the results back to the original file. By default Tidy will output to stdout. We could create a script to create new files and leave the original alone. But since we have all the originals in the zip file still we will overwrite the ones here.
  • tidy-mark - Tidy by default adds metadata to our document indicating that it processed the output. Since we want our output to be as clean as possible for our next step we don't want this extra info.
NOTE: I'm using the settings: drop-empty-paras, merge-divs, and merge-spans to account for any occurrences where the original author unknowingly created extra elements, which is very common when using wysiwyg (what you see is what you get) editors. Authors will sometimes hit the enter key a few times to create formatting, unknowing that behind the scenes they are adding extra empty <p> elements.


Processing with Tidy

After we have created our configuration file, we are ready to begin processing the files. We tell tidy to use our configuration file we just created and to process all *.html files in our directory that we unzipped our documents into.

$ tidy -config ~/projects/aem-xml/tidy/config.txt *.html

Depending on how many documents you have and the complexity of them, tidy should complete its task anywhere from a few seconds to a minute or two. If we reopen our document after tidy has processed it we should now see proper xhtml.


As you can see above, our document has been reformatted, the xml declarations and namespace have been added and if there were any issues with our html it is now resolved for us as well.


Scrolling to the bottom of the page, you can see the html <section> tag(s) are still contained in the output as well as other class names and ids.



You will also notice our images that are contained in the attachments folder have the following markup:


Once our documents are imported and processed in AEM our images will need to be uploaded to the DAM. Which will either change their path or add "/content/dam/" to the path. If we forget this step, good luck trying to reassociate the images back to the original docs.

If we attempted to import our documents at this point our workflows would process these documents but not create proper Dita Topics from them and would require even more manual work for each document. 

The XML Documentation feature for AEM will allow us to apply custom XSLT when processing our documents so that they end up as Dita topics and recognized in AEM as such.


Applying Custom XSLT in AEM

In this next step, we will need access to an AEM Author instance and the XML Documentation feature installed.

After examining our documents we know there are a few tasks we need to perform to clean them up a bit further.

  • Remove empty elements
  • Remove all class names and ids
  • Update our image paths
First we want to upload all the assets in the attachments folder to the dam. We will put these in "/content/dam/attachments/...". Take a quick peek in the images directory that was exported to determine if there is anything we need and upload as appropriate.  If not, you may also need to update/remove those elements in our documents when we import them.

Open crxde, http://<host:port>/crx/de/index.jsp, and log in as an administrator. We will need to create an overlay so that we can specify our input and output folders for the workflow.

Copy the /libs/fmdita/config/h2d_io.xml file to /apps/fmdita/config/h2d_io.xml, and update the inputDir and outputDir elements to the path(s) you will be using.



You will notice there is already a subdirectory html2dita with a h2d_extended.xsl file.  When html documents are uploaded to our input folder, in addition to the default processing, this file is also included by default in that process. 

Out-of-the-box the /apps/fmdita/config/html2dita/h2d_extended.xsl file just has the xsl declaration and nothing else. We will add our transformations to this file so that everything uploaded is processed the same way.

We will create an identity template to do the majority of work for us. While this is very general and applies to all elements, you should definitely examine your own data first to determine how best to process it.



Images are a little more work to get right, but not overly complicated. We want to ensure we are only modifying internal images and not ones that may be linked from other sites, in other words the src attribute should start with 'attachments'. Also ensure to take note, our XML editor is expecting the element tag for images to be <image href='<path_to_file>'/> and not the xhtml <img src='<path_to_file>'/> element.



Once we have our transforms in place we are ready to upload our data. The XML Documentation feature comes with a few different workflows to process html documents once uploaded. 

   



This will allow us to either upload our pages individually one by one, or we can repackage them into a zip file and upload that zip file to our input folder. Since we will be uploading a few hundred pages the zip option will be the one we want to go with.

If we wanted to merely test our process, we could pick one or two html files and upload just those. Watching the logs and checking the output folder will give us an indication if everything is working correctly or if there are additional transforms or cleanup that will be required.

After uploading we can switch to our XML Editor in AEM and see the new *.dita files in our output folder that we had previously defined. Each file is named 1:1 for its original filename. So if we had uploaded a file 123.html to our input folder, there should now be a 123.dita file in our output folder.



If our cleanup and transforms worked properly, we now can double-click on any of these new *.dita files and see the results of our hard work.



Conclusion

Using a few widely available tools we can successfully migrate documents from Confluence into AEM using the XML Documentation features. Of course this is merely one step in the process of many for performing a true migration and fully using Dita to our benefit. Once the documents are in the Dita format, a Content Author familiar with Dita should go through the documentation looking for areas of reuse, identify audiences, create maps, etc.

If you are serious about working with Dita you should consider using a compliant editor such as Adobe Framemaker. Framemaker can be integrated with AEM to provide a better experience for your team to create Dita documents, collaborate, and publish them in Adobe Experience Manager.

Tuesday, September 1, 2020

Checking Links with Scrapy

 Whether you are supporting an AEM project or any other web project one of the regular tasks someone on the team should perform is to check all the links on the site before go-live so the authoring team can correct them.

If we only have a few pages this can be a pretty easy task for someone to manually perform, but for large sites with hundreds or thousands of pages can be a very tall task even for the whole team to perform. One strategy that is commonly used is to take all known URLs and/or redirects, put them in a file and a quick curl script to iterate over them; but what about unknown links that may have been created or links returning error codes.

 Luckily we can use the open-source scrapy framework to create a custom crawler to check all the links for us and output a report that we can then give to the authoring team. Scrapy is a Python framework that provides both command line utilities with prebuilt spiders and the ability to customize the spiders to our specific needs. While the framework is originally written to scrape data from a website we can also use it for our purposes as well.


Before we get started you will need a system with Python3 and PyPI (pip) installed to use as our development environment.

Install Scrapy using pip

$ pip install scrapy


Normally at this step you would create a folder for all our project files, which you can certainly do. For this project we will use the scrapy cmd line util to create our initial project structure for us.

$ scrapy startproject linkCrawler

 The command above, scrapy will create the initial project 'linkCrawler' and all necessary files to get started. If we take a look at the structure it should look like the following:

$ tree .

.

├── linkCrawler

│   ├── __init__.py

│   ├── items.py

│   ├── middlewares.py

│   ├── pipelines.py

│   ├── settings.py

│   └── spiders

│       └── __init__.py

└── scrapy.cfg


2 directories, 7 files




Next, we need to create our first spider. Again, we will use the scrapy utility to generate our spider for us and we will customize it to our needs later.

$ cd linkCrawler


$ scrapy genspider -t crawl example example.com


Created spider 'example' using template 'crawl' in module:

  linkCrawler.spiders.example


Let's break down the command above:

scrapy genspider - tells scrapy to generate a spider for us
-t crawl - tells scrapy to use the 'crawl' template when creating the spider
example - our spider's name.
example.com - The domain our spider will crawl


Let's take a look at our project structure. We can see scrapy added our spider example.py.

$ tree .

.

├── linkCrawler

│   ├── __init__.py

│   ├── items.py

│   ├── middlewares.py

│   ├── pipelines.py

│   ├── settings.py

│   └── spiders

│       ├── __init__.py

│       └── example.py

└── scrapy.cfg


2 directories, 11 files


The two files we will concentrate on is the 'settings.py' and 'example.py' files.

Since we are crawling our own site and we want to check ALL the links on our site, we want to tell our spider to disregard the rules in the robots.txt file. Normally we would want to obey by the rules in the robots.txt file especially if we do not own the site. 

We can turn this feature off  by opening the settings.py file and looking for the following line and changing to 'False'.

# Obey robots.txt rules

ROBOTSTXT_OBEY = False


Save the file. 

Next we will modify our spider to check for bad links and create a report.

Opening the example.py file  we can see the basic structure is already created for us.

import scrapy

from scrapy.linkextractors import LinkExtractor

from scrapy.spiders import CrawlSpider, Rule



class ExampleSpider(CrawlSpider):

    name = 'example'

    allowed_domains = ['example.com']

    start_urls = ['http://example.com/']


    rules = (

        Rule(LinkExtractor(allow=r'Items/'), callback='parse_item', follow=True),

    )


    def parse_item(self, response):

        item = {}

        #item['domain_id'] = response.xpath('//input[@id="sid"]/@value').get()

        #item['name'] = response.xpath('//div[@id="name"]').get()

        #item['description'] = response.xpath('//div[@id="description"]').get()

        return item



First lets add an object to hold our report data items. 

class BadLinks(Item):

    referer = Field()

    url = Field()

    status = Field()

    dispatcher = Field()



So that our report is useful we want to capture a few data items:
  • Referer: What page were we on when the link was followed
  • URL: What is the link that was followed
  • Status: The HTTP status code returned when we crawled the link
  • Dispatcher: The value of the 'X-Dispatcher' header. This will tell us what dispatcher/publish pair had an issue so that we can investigate if there is a problem with that pair.

In our ExampleSpider class,  we will add support for non HTTP 200 codes.

class ExampleSpider(CrawlSpider):

    name = 'example'

    allowed_domains = ['example.com']

    start_urls = ['http://example.com/']

    handle_httpstatus_list = [404,410,301,500]


Here is an explanation of what is going on in these few lines:
  • name: This is the name of this spider and must be unique.
  • allowed_domains: List of the domains we want the spider to crawl. We could include any subdomains or other domains linked to this site that we own and want to check.
  • start_urls: List of urls where we want crawling to begin at. Since we want the entire site we will leave this at the root of the site.
  • handle_httpstatus_list: List of http status codes outside of the 200-300 range that we want this spider to handle.
Lets add a few rules telling our spider what it should crawl.

rules = [

    Rule(

      LinkExtractor(allow_domains=allowed_domains, deny=('/media/*'), unique=('Yes')),

      callback='parse_item',

      follow=True

    ),

    Rule(

      LinkExtractor(allow=(''),unique=('Yes')),

      callback='parse_item',

      follow=False

    )

]




In the above section of code we have two Rule objects using the LinkExtractor object. 

The first rule we specify the allow_domains to match our variable, we add a deny for the /media/ section of the site, and that we only wish to capture unique values.

The second Rule object allows all links to be identified but not followed. This will cover our external links in the output but we won't crawl those sites.

In the callback we call 'parse_items' to identify how the link is to be handled. So let's modify that method to record the data for bad links.

 def parse_item(self, response):

        report_if = [404,500]

        if response.status in report_if:

            item = BadLinks()

            item['referer'] = response.request.headers.get('Referer', None)

            item['status'] = response.status

            item['response'] = response.url

            item['dispatcher'] = response.headers.get('X-Dispatcher', None)

            yield item

        yield None



In the section above we check if the response status code is a 404 or 500, if it is we parse the values we want for our report.

Our complete spider should now look like the following:

# -*- coding: utf-8 -*-

import scrapy

from scrapy.linkextractors import LinkExtractor

from scrapy.spiders import CrawlSpider, Rule

from scrapy.item import Item, Field


class BadLinks(Item):

    referer = Field()

    response = Field()

    status = Field()

    dispatcher = Field()


class ExampleSpider(CrawlSpider):

    name = 'example'

    allowed_domains = ['example.com']

    start_urls = ['http://example.com/']

    handle_httpstatus_list = [404,410,301,500]


    rules = [

        Rule(

           LinkExtractor(allow_domains=allowed_domains, deny=('/media/*'), unique=('Yes')),

           callback='parse_item',

           follow=True

        ),

        Rule(

           LinkExtractor(allow=(''),unique=('Yes')),

           callback='parse_item',

           follow=False

        )

    ]


   def parse_item(self, response):

        report_if = [404,500]

        if response.status in report_if:

            item = CrawlItems()

            item['referer'] = response.request.headers.get('Referer', None)

            item['status'] = response.status

            item['response'] = response.url

            item['dispatcher'] = response.headers.get('X-Dispatcher', None)

            yield item

        yield None



We are now ready to run our spider. Save the file and start our spider with the following command:

$ scrapy crawl example -o report.csv



While the spider is running you will see debug info sent to stdout and any links resulting in a 404 or 500 will be captured in our report.

From here we could add additional spiders to this project to handle checking 301 redirects, warm cache, or scrape data from our pages.