Monday, December 21, 2020

Hockey Experience Project - Part 1

Preseason Activities 

The family and I are big hockey fans, during a normal year we are either at PNC Arena watching the Carolina Hurricanes, watching them on T.V., or discussing the upcoming season.  So when Covid-19 hit the U.S. and Canada the rest of the NHL season was cancelled. Through the summer we watched reruns of past games and kept tabs on when or if playoffs would even happen.  


We were excited when the NHL announced playoffs were coming, but saddened to learn it would be done without fans.  If you have never been to an NHL playoff game, GO! (Well, once we are allowed to go that is). Even if you don't care for hockey you will have a great time. The phrase "There is nothing like playoff hockey" is absolutely true and I wanted to bring that same experience to our home. 

So how do we bring the playoff hockey experience to our living room?

We have a big screen t.v. to watch the game on and the family and I will definitely be cheering loudly, but what about the rest of the experience. When the Canes score there is no goal horn, no goal song, no flashing goal light, no rally towels to swing around. What about during intermissions when we want to drink an overpriced beer and snack on our favorite foods from the concession stand?

Ok so in order to pull off this experience we are under a time crunch, 2 months, to figure out quite a bit and see how far we can get to a legit stadium experience. Food and drinks can be sorted later if we get the rest of it going so let's get to the first part.


Action!Lights!Goal Horn!

I've heard of a few raspberry pi projects where fellow hockey fans created programs to pull scores automatically and displays scores or flash a light.  So I knew the possibility is there but how do we put it together and expand on that.

As luck would have it I received a $100 Amazon gift card as a gift. So now I knew what I wanted to do was possible, I had the desire to do it, and now I've received funds to make it possible. Let's get to work!

I've always wanted to get a raspberry pi to play around with and now I have an excuse. Checking Amazon pricing I could get a Canakit Raspberry Pi 3B+ Starter kit that has everything I'll need to get started for just under $100.  Sure, I could have paid a bit more and got a Pi4, but the 3B+ will have plenty of processing power and features to do exactly what I want and more.

So while we await for our Amazon delivery on to the next problem, where do we get the scores from so that we can make this into a completely automated system.

SCORE!

Or should I says scores. After much googling around, I found there are quite a few 'real time' sports API sites, but they all want $$ and there is no guarantee the scores are truly live or that they have NHL scores. I mean, even when watching the game on t.v. there is a delay from when the action actually occurred and when it was broadcasted. To add to that we will be watching the games on YouTube TV which is a streaming service, so it is a bit of an unknown how real time we will be seeing things. 

Further googling I found some documentation on the NHL's APIs that don't require signing up for a service or an API key to access! Score! You can check these out here

Now for some test runs of the APIs to get an idea of what information we will get back. We will use Postman to make the API calls and verify the information. If you've never used Postman before, its a great tool for testing APIs and is free for individual developers. While this is a manual step in the process, this will help us identify which APIs we want, what data we will get back and the expected format so when we write the actual application we know which APIs to use.

A note on calling APIs 

    Some of these APIs return quite a bit of information, there are APIs for getting team info, schedules, player stats, team stats, game scores, etc. You always want to pick the APIs for use that return just the data you need so you save on bandwidth and processing. I mean if you simply want to know the score of a particular game, then you don't want to pull all game scores with expanded player stats, you would want to call the API for that game and keep it simplified. 

You also want to ensure you aren't calling the APIs every millisecond. It may seem like a good idea and think you are getting the most up-to-date information, but more than likely your calls will be seen as a DoS attack and your IP will get banned real fast.

Under Further Review

Ok, lets make some calls and validate what we want. If you have taken a minute to look at the Teams APIs, you'll notice everything uses a teamId. This is an internal id that identifies each team's data. So let's find out what Carolina's teamId is. 

NOTE: If your not a Carolina fan, you can use the same calls just substitute your team's id where we use Carolina's. 

Our first call is to get all the NHL teams:

As you can see in the screenshot above, we make a HTTP GET call to https://statsapi.web.nhl.com/api/v1/teams to get all the teams. Scrolling down through the list or doing a find we find Carolina's info around line 390. There is a lot of info on the team here, but the part we are interested in is the line with 'id: 12' . The number '12' is what we will use in future calls to get Carolina's data. 

To test this we can add '?teamId=12' to the same API call as before to get just Carolina's team data.


Success! 

When we write our program we can just use '12' as the teamId variable so we will just make note of it, it shouldn't ever change so there is no need to constantly look it up.


Is Aho Playing Tonight?

In order to get the live scores on the day of the game we will need to know the gameId ahead of time. The gameId is much like the teamId in that it is a unique identifier so that we can get data about just the game we are interested in. We can reasonably expect this gameId to be different for each game the Hurricanes are playing in. So our program will need to first pull the schedule and get the gameId for the Carolina game.

If we look at the schedule API documentation, there are quite a few parameters we can use to get to the data we want.  We will use the variables teamId=12 and date=currentDate to see if there is a game and if so get its id. As a test I know the published schedule shows one for the 14th so we will use that date to see what we get back.



If we scroll through the results there is some info on the teams playing, venue the event is held at, etc. The part we are interested in is the gamePk, that is the unique identifier for this game, they also give us a nice link to the live feed. 
 
 If our team wasn't playing on that day we would get an empty array back for the dates field. So now we can poll the schedule once a day to find out if our team plays, if they do we can get the time from the schedule and beginning at that time start polling the live feed more frequently for scores.

Before we start calling the live feed, which can return over 30k lines of data and includes all the plays, etc. We will want to take a look at what is available from the docs and choose the call that best fits our needs.

There is boxscore and linescore that returns far less data but still more than what we need. If you still want to use the live feed, I would suggest taking a look at the diffPatch option. This allows you to give a date/time parameter and the feed will return ALL changes since that date/time.

Game Delay

Meh. So after doing some testing the APIs are only updated every minute when there is an update. Although hockey is usually a low scoring game it is very fast paced and a 3 point lead can be lost in a minute of game play. So knowing my APIs will be at most 1 minute off, and the unknown of how delayed my streaming service is going to be at any given moment I need a better solution so that everything is in synch. 

No one wants to watch their team score then 3 minutes later the goal horn plays or vice versa.  To get the full experience you want to see the play, hear the goal horn followed by the goal song. I could do a few trial runs and put a delay in to synch the t.v. with the app, but streaming services are at the mercy of your Internet connection and could randomly fall further behind or buffer and catch up.


A Manual Intervention

The only way to keep everything in synch is to perform some sort of manual task or execution of the program. That way when I see the action I can execute my program and get the experience I want.

But what if I'm not in the room when they score, also I don't really want to lug my laptop around the house just to run this program. I mean the program will be executed from my pi and I don't want to fumble with ssh sessions or try to educate my family on how to execute programs from a command line. If this is a manual task it should be simple to do and execute immediately so that the overall hockey experience isn't lost.

 I just remembered, I have an old tablet just sitting in the closet. We acquired it some time ago when everyone had one of those deals "Add a new line and get a free Android tablet". We used the tablet for a brief time period, then it was only used when traveling, and finally just went to the closet. 

Time For Some Web Work

Ok, so I think we finally have our solution that we will build in part 2 of this blog. I'll create a web application that runs on the pi, we can use the tablet to access since all we should need is a basic browser it doesn't matter which one we use or how old it is. Our web app will at a minimum display a button that when clicked will play our team's goal horn followed by the goal song.

If you aren't familiar, every NHL team has their own goal horn and goal song. Here is the one for the Carolina Hurricanes that I'll be putting together. 

NOTE: I'm not using YouTube or anything off of it, I'm only referencing it here to give you an idea of the sound I'll be playing.

Oh crap, the raspberry pi doesn't have a speaker. 

Oh wait! back to the closet of lost tech toys. I got it, we have an old Altec Lansing bluetooth speaker that no one uses anymore. It was one of those purchases we could take to the pool, yes its waterproof, to play music from our phones while at the pool. It too got used for awhile before it was sent to the closet.

Ok so we are still on track. When someone clicks a button on our webapp from the tablet, the pi will play the goal horn followed by the goal song and turn a goal light on.

So searching Amazon I decided to get this light. Its cheap, simple and will give me the effect I'm looking for.


Checkout Part 2 of the Hockey Experience Project for how we will put it all together.

Thursday, September 10, 2020

Convert Confluence Documents to AEM/XML

 Atlassian's Confluence is a great tool for managing and sharing project documentation with your team and its seamless integration into Jira can help put everything into one place.  We've used Confluence and Jira for almost all documentation, from general project docs, to How-Tos, and even New Hire Onboarding. 

Recently I started going through the process of rewriting our training material for DevOps Engineers and found quite a lot of duplication in documentation. For some topics such as 'Git', Adobe Experience Manager, and other topics there were documents created for both front-end and back-end developers as well as Project Managers, QA Engineers, etc.

As I discovered all this duplication and extra work I thought, there has to be a better way, we have too many people creating and maintaining the same thing over and over with only slight variations. My first thought was to move common documents to a shared location and assign owners. This would enable different groups to cross-reference material already documented for their respective teams  The dev team would own and be responsible for creating and maintaining 'git' and code related topics, the devops team would be responsible for topics on AEM setup, configuration, troubleshooting, etc. 

While moving everything around in Confluence and assigning owners may help reduce the number of people updating documents, it doesn't solve the issue of ensuring the documentation fits the audience. Project Managers probably care very little about how to do a git merge and squash those commits, but may need to know general info, such as what git is, how it can be used on a project (without getting too far in to the technical weeds).

Through my searching to find a better way, I stumbled across Dita (Darwin Information Typing Architecture).  After reading up on it, I realized this is what I was looking for; a way to create documentation in a standardized way once and have it rendered to fit the needs of the audience. 

Much to my dismay, Confluence doesn't have a Dita plugin or support it directly. So this means I would either need to recreate all our documentation into the Dita format or find a way to easily convert it.  

Having worked on numerous AEM projects as a Full-stack developer, DevOps Engineer, and AEM Architect, I remembered there is an XML Documentation feature for AEM that should do what I want. But first, I need to export my content from Confluence.


Exporting Confluence Documents

Confluence makes it very easy to export a document or an entire space in multiple formats, from pdfs, MS Word documents, HTML, etc.

In this example, we are going to export the entire space, this will give us all parent pages, child pages, assets, styles, etc.  

To export a site in Confluence, go to: Space Settings -> Content Tools -> Export.



NOTE: While there is an export to xml option, this won't meet our needs. However the export to html option is perfect for what we want as the XML Documentation feature also provides some workflows to convert our html documents into Dita topics.

Select 'HTML', and click 'Next'. 

Confluence will gather up all our documents and assets, convert the docs to html and package in a zip file for us to download.


After downloading and unpacking our zip archive, examining the content we can see each page is represented in html, contains references to other html pages in our space, but it also contains ids, attributes, and even some elements that are very Confluence specific. 


Since we will be using AEM to render the documents, we don't need a lot of the class names, ids, and other bits Confluence added for us. It is also important to note, that our documents need to be in xhtml format before AEM will convert them to Dita. 

If we uploaded this document as-is we can expect nothing to happen,  this document would not be processed by the workflow. If we simply add the xml header to identify this document as an xhtml document, the workflow would attempt to process the document, but would fail with many errors. So we will need a way to pre-process them to clean them up.



Cleaning Up With Tidy

If you are not familiar with HTML Tidy, it is a great command line utility that can help cleanup and correct most errors in html and xml documents. While we are not expecting that we have any "bad" html, we know we will probably have some empty div elements, Confluence specific items, and since we are processing hundreds of documents we want to ensure they meet the xhtml standard and are as clean as possible without the need to manually go through each one individually correcting errors.


Create a Tidy Config

A Tidy config will help ensure all documents are pre-processed the same way. so that we have a nice uniform output. using your favorite text editor, create a config.txt file that will hold the configuration below.

clean: true

indent: auto

indent-spaces: 4

output-xhtml: true

add-xml-decl: true

add-xml-space: true

drop-empty-paras: true

drop-proprietary-attributes: true

bare: true

word-2000: true

new-blocklevel-tags: section

write-back: true

tidy-mark: false

merge-divs: true

merge-spans: true

enclose-text: true

To read more about what each of these settings does and other options available, check out the API doc page.

Instead of going over every option used above as most should be self-explanatory as to what they do, there are a few that need to be called out.

  • output-xhtml - Tells Tidy we want the output in xhtml, the format we need for AEM to process.
  • add-xml-decl - Adds the xml declaration to our output document
  • new-blocklevel-tags - Confluence adds a 'section' element to all out pages, this element does not conform to xhtml and Tidy will throw an error and refuse to process those docs unless we tell Tidy that it is an acceptable element. NOTE: This is a comma separated list of elements, so if you have others feel free to add them here. 
  • write-back - write the results back to the original file. By default Tidy will output to stdout. We could create a script to create new files and leave the original alone. But since we have all the originals in the zip file still we will overwrite the ones here.
  • tidy-mark - Tidy by default adds metadata to our document indicating that it processed the output. Since we want our output to be as clean as possible for our next step we don't want this extra info.
NOTE: I'm using the settings: drop-empty-paras, merge-divs, and merge-spans to account for any occurrences where the original author unknowingly created extra elements, which is very common when using wysiwyg (what you see is what you get) editors. Authors will sometimes hit the enter key a few times to create formatting, unknowing that behind the scenes they are adding extra empty <p> elements.


Processing with Tidy

After we have created our configuration file, we are ready to begin processing the files. We tell tidy to use our configuration file we just created and to process all *.html files in our directory that we unzipped our documents into.

$ tidy -config ~/projects/aem-xml/tidy/config.txt *.html

Depending on how many documents you have and the complexity of them, tidy should complete its task anywhere from a few seconds to a minute or two. If we reopen our document after tidy has processed it we should now see proper xhtml.


As you can see above, our document has been reformatted, the xml declarations and namespace have been added and if there were any issues with our html it is now resolved for us as well.


Scrolling to the bottom of the page, you can see the html <section> tag(s) are still contained in the output as well as other class names and ids.



You will also notice our images that are contained in the attachments folder have the following markup:


Once our documents are imported and processed in AEM our images will need to be uploaded to the DAM. Which will either change their path or add "/content/dam/" to the path. If we forget this step, good luck trying to reassociate the images back to the original docs.

If we attempted to import our documents at this point our workflows would process these documents but not create proper Dita Topics from them and would require even more manual work for each document. 

The XML Documentation feature for AEM will allow us to apply custom XSLT when processing our documents so that they end up as Dita topics and recognized in AEM as such.


Applying Custom XSLT in AEM

In this next step, we will need access to an AEM Author instance and the XML Documentation feature installed.

After examining our documents we know there are a few tasks we need to perform to clean them up a bit further.

  • Remove empty elements
  • Remove all class names and ids
  • Update our image paths
First we want to upload all the assets in the attachments folder to the dam. We will put these in "/content/dam/attachments/...". Take a quick peek in the images directory that was exported to determine if there is anything we need and upload as appropriate.  If not, you may also need to update/remove those elements in our documents when we import them.

Open crxde, http://<host:port>/crx/de/index.jsp, and log in as an administrator. We will need to create an overlay so that we can specify our input and output folders for the workflow.

Copy the /libs/fmdita/config/h2d_io.xml file to /apps/fmdita/config/h2d_io.xml, and update the inputDir and outputDir elements to the path(s) you will be using.



You will notice there is already a subdirectory html2dita with a h2d_extended.xsl file.  When html documents are uploaded to our input folder, in addition to the default processing, this file is also included by default in that process. 

Out-of-the-box the /apps/fmdita/config/html2dita/h2d_extended.xsl file just has the xsl declaration and nothing else. We will add our transformations to this file so that everything uploaded is processed the same way.

We will create an identity template to do the majority of work for us. While this is very general and applies to all elements, you should definitely examine your own data first to determine how best to process it.



Images are a little more work to get right, but not overly complicated. We want to ensure we are only modifying internal images and not ones that may be linked from other sites, in other words the src attribute should start with 'attachments'. Also ensure to take note, our XML editor is expecting the element tag for images to be <image href='<path_to_file>'/> and not the xhtml <img src='<path_to_file>'/> element.



Once we have our transforms in place we are ready to upload our data. The XML Documentation feature comes with a few different workflows to process html documents once uploaded. 

   



This will allow us to either upload our pages individually one by one, or we can repackage them into a zip file and upload that zip file to our input folder. Since we will be uploading a few hundred pages the zip option will be the one we want to go with.

If we wanted to merely test our process, we could pick one or two html files and upload just those. Watching the logs and checking the output folder will give us an indication if everything is working correctly or if there are additional transforms or cleanup that will be required.

After uploading we can switch to our XML Editor in AEM and see the new *.dita files in our output folder that we had previously defined. Each file is named 1:1 for its original filename. So if we had uploaded a file 123.html to our input folder, there should now be a 123.dita file in our output folder.



If our cleanup and transforms worked properly, we now can double-click on any of these new *.dita files and see the results of our hard work.



Conclusion

Using a few widely available tools we can successfully migrate documents from Confluence into AEM using the XML Documentation features. Of course this is merely one step in the process of many for performing a true migration and fully using Dita to our benefit. Once the documents are in the Dita format, a Content Author familiar with Dita should go through the documentation looking for areas of reuse, identify audiences, create maps, etc.

If you are serious about working with Dita you should consider using a compliant editor such as Adobe Framemaker. Framemaker can be integrated with AEM to provide a better experience for your team to create Dita documents, collaborate, and publish them in Adobe Experience Manager.

Tuesday, September 1, 2020

Checking Links with Scrapy

 Whether you are supporting an AEM project or any other web project one of the regular tasks someone on the team should perform is to check all the links on the site before go-live so the authoring team can correct them.

If we only have a few pages this can be a pretty easy task for someone to manually perform, but for large sites with hundreds or thousands of pages can be a very tall task even for the whole team to perform. One strategy that is commonly used is to take all known URLs and/or redirects, put them in a file and a quick curl script to iterate over them; but what about unknown links that may have been created or links returning error codes.

 Luckily we can use the open-source scrapy framework to create a custom crawler to check all the links for us and output a report that we can then give to the authoring team. Scrapy is a Python framework that provides both command line utilities with prebuilt spiders and the ability to customize the spiders to our specific needs. While the framework is originally written to scrape data from a website we can also use it for our purposes as well.


Before we get started you will need a system with Python3 and PyPI (pip) installed to use as our development environment.

Install Scrapy using pip

$ pip install scrapy


Normally at this step you would create a folder for all our project files, which you can certainly do. For this project we will use the scrapy cmd line util to create our initial project structure for us.

$ scrapy startproject linkCrawler

 The command above, scrapy will create the initial project 'linkCrawler' and all necessary files to get started. If we take a look at the structure it should look like the following:

$ tree .

.

├── linkCrawler

│   ├── __init__.py

│   ├── items.py

│   ├── middlewares.py

│   ├── pipelines.py

│   ├── settings.py

│   └── spiders

│       └── __init__.py

└── scrapy.cfg


2 directories, 7 files




Next, we need to create our first spider. Again, we will use the scrapy utility to generate our spider for us and we will customize it to our needs later.

$ cd linkCrawler


$ scrapy genspider -t crawl example example.com


Created spider 'example' using template 'crawl' in module:

  linkCrawler.spiders.example


Let's break down the command above:

scrapy genspider - tells scrapy to generate a spider for us
-t crawl - tells scrapy to use the 'crawl' template when creating the spider
example - our spider's name.
example.com - The domain our spider will crawl


Let's take a look at our project structure. We can see scrapy added our spider example.py.

$ tree .

.

├── linkCrawler

│   ├── __init__.py

│   ├── items.py

│   ├── middlewares.py

│   ├── pipelines.py

│   ├── settings.py

│   └── spiders

│       ├── __init__.py

│       └── example.py

└── scrapy.cfg


2 directories, 11 files


The two files we will concentrate on is the 'settings.py' and 'example.py' files.

Since we are crawling our own site and we want to check ALL the links on our site, we want to tell our spider to disregard the rules in the robots.txt file. Normally we would want to obey by the rules in the robots.txt file especially if we do not own the site. 

We can turn this feature off  by opening the settings.py file and looking for the following line and changing to 'False'.

# Obey robots.txt rules

ROBOTSTXT_OBEY = False


Save the file. 

Next we will modify our spider to check for bad links and create a report.

Opening the example.py file  we can see the basic structure is already created for us.

import scrapy

from scrapy.linkextractors import LinkExtractor

from scrapy.spiders import CrawlSpider, Rule



class ExampleSpider(CrawlSpider):

    name = 'example'

    allowed_domains = ['example.com']

    start_urls = ['http://example.com/']


    rules = (

        Rule(LinkExtractor(allow=r'Items/'), callback='parse_item', follow=True),

    )


    def parse_item(self, response):

        item = {}

        #item['domain_id'] = response.xpath('//input[@id="sid"]/@value').get()

        #item['name'] = response.xpath('//div[@id="name"]').get()

        #item['description'] = response.xpath('//div[@id="description"]').get()

        return item



First lets add an object to hold our report data items. 

class BadLinks(Item):

    referer = Field()

    url = Field()

    status = Field()

    dispatcher = Field()



So that our report is useful we want to capture a few data items:
  • Referer: What page were we on when the link was followed
  • URL: What is the link that was followed
  • Status: The HTTP status code returned when we crawled the link
  • Dispatcher: The value of the 'X-Dispatcher' header. This will tell us what dispatcher/publish pair had an issue so that we can investigate if there is a problem with that pair.

In our ExampleSpider class,  we will add support for non HTTP 200 codes.

class ExampleSpider(CrawlSpider):

    name = 'example'

    allowed_domains = ['example.com']

    start_urls = ['http://example.com/']

    handle_httpstatus_list = [404,410,301,500]


Here is an explanation of what is going on in these few lines:
  • name: This is the name of this spider and must be unique.
  • allowed_domains: List of the domains we want the spider to crawl. We could include any subdomains or other domains linked to this site that we own and want to check.
  • start_urls: List of urls where we want crawling to begin at. Since we want the entire site we will leave this at the root of the site.
  • handle_httpstatus_list: List of http status codes outside of the 200-300 range that we want this spider to handle.
Lets add a few rules telling our spider what it should crawl.

rules = [

    Rule(

      LinkExtractor(allow_domains=allowed_domains, deny=('/media/*'), unique=('Yes')),

      callback='parse_item',

      follow=True

    ),

    Rule(

      LinkExtractor(allow=(''),unique=('Yes')),

      callback='parse_item',

      follow=False

    )

]




In the above section of code we have two Rule objects using the LinkExtractor object. 

The first rule we specify the allow_domains to match our variable, we add a deny for the /media/ section of the site, and that we only wish to capture unique values.

The second Rule object allows all links to be identified but not followed. This will cover our external links in the output but we won't crawl those sites.

In the callback we call 'parse_items' to identify how the link is to be handled. So let's modify that method to record the data for bad links.

 def parse_item(self, response):

        report_if = [404,500]

        if response.status in report_if:

            item = BadLinks()

            item['referer'] = response.request.headers.get('Referer', None)

            item['status'] = response.status

            item['response'] = response.url

            item['dispatcher'] = response.headers.get('X-Dispatcher', None)

            yield item

        yield None



In the section above we check if the response status code is a 404 or 500, if it is we parse the values we want for our report.

Our complete spider should now look like the following:

# -*- coding: utf-8 -*-

import scrapy

from scrapy.linkextractors import LinkExtractor

from scrapy.spiders import CrawlSpider, Rule

from scrapy.item import Item, Field


class BadLinks(Item):

    referer = Field()

    response = Field()

    status = Field()

    dispatcher = Field()


class ExampleSpider(CrawlSpider):

    name = 'example'

    allowed_domains = ['example.com']

    start_urls = ['http://example.com/']

    handle_httpstatus_list = [404,410,301,500]


    rules = [

        Rule(

           LinkExtractor(allow_domains=allowed_domains, deny=('/media/*'), unique=('Yes')),

           callback='parse_item',

           follow=True

        ),

        Rule(

           LinkExtractor(allow=(''),unique=('Yes')),

           callback='parse_item',

           follow=False

        )

    ]


   def parse_item(self, response):

        report_if = [404,500]

        if response.status in report_if:

            item = CrawlItems()

            item['referer'] = response.request.headers.get('Referer', None)

            item['status'] = response.status

            item['response'] = response.url

            item['dispatcher'] = response.headers.get('X-Dispatcher', None)

            yield item

        yield None



We are now ready to run our spider. Save the file and start our spider with the following command:

$ scrapy crawl example -o report.csv



While the spider is running you will see debug info sent to stdout and any links resulting in a 404 or 500 will be captured in our report.

From here we could add additional spiders to this project to handle checking 301 redirects, warm cache, or scrape data from our pages.

Saturday, June 13, 2020

Automating Sandbox Environments - Part I

This series is based on an internal project I started last year.  Working on multiple projects at the same time usually means my time is very valuable, so I'm always looking to improve, automate tasks, and empower our Team.


The Problem

Periodically I get asked to provision an AEM environment to showcase our work to existing and potential partners and clients.  Of course with any environment you provision, those that have access to it don't want you to remove it,  even when the project is over, just in case they need it for 'something'. 

Once word about this environment gets out, other teams, such as Analytics, UX/UI, and Marketing may also want to use the environment as well. What ends up happening is there really is no one 'managing' the application, you have a single set of servers trying to fulfill different needs for different teams.

 Over time you find the original owner of the environment is no longer with the company, code and content has become stale and the application may be starting to throw errors.


Acceptance Criteria

Analyzing the problem above we find that there are actually several problems we need to solve for.

  1. Each team should have their own environment that is specific to their use-case.
    Use-Case: If the Frontend dev team wants to showcase a SPA, that work should not conflict with anything the Data Engineering team wants to show for integrating Data Layers.

  2. Each team should be able to have more than one environment.
    Use-case: The team may be showcasing a particular feature or implementation and want to give the customer limited access to test the feature.

  3. Environments should be repeatable and have a short shelf-life.
    Use-case: We don't want to have someone playing 'cleanup' at the end of every demo in order to get it prepared for the next demo. In addition, if the servers aren't being used we don't want to pay for their up-time.

  4. Provisioning an environment should be quick.
    Use-case: Sometimes we will have a request for an environment months before it is needed, other times we may be alerted hours prior to a meeting with a potential partner that they wish to see certain features.

  5. Our teams should be empowered.
    Use-case: While we want to be able to place strict controls over what is provisioned and how it is provisioned, our teams should be empowered to 'self-serve' a bit and dictate what that environment is used for.

  6. We should be able to provision different versions of the application.
    Use-case: While most demos will always be with the latest and greatest, sometimes a client wants to see things on the same version they have. In addition, we don't want to spend a lot of time performing upgrades.

  7. Employees should be able to log in to the application right away with their corporate IDs.
    Use-case: We don't want people sharing an account, which is a security issue, and we don't want someone to have to manually create and manage user accounts and permissions in the application

  8. We don't want to be locked in to platform or provider.
    Use-case: With very little work, we should be able to adapt our solution to be applied to AWS, Azure, VMWare, or any other provider.

The Solution

While you may already be thinking of different possibilities, such as Dockerfiles, AMIs,  Cloud Formation, Ansible, etc. Alone, each possible solution has it's pros and cons and may not meet all of our criteria.


For our solution we will be using several technologies together. Initially we will start with provisioning to Amazon Web Services. Later we may add support to provision to Azure as well.

We will use Packer to create images; Ansible to customize the images; Terraform to provision the resources on AWS, and Make to help tie it all together and make the commands more friendly. Lastly, we will use Docker to create a Dockerfile of our control machine so that we don't need a dedicated resource.

NOTE: This series does not aim to teach these technologies, you should already be familiar with them.

Requirements:

  • An AWS account with an IAM user that has privileges to provision EC2 instances, create AMIs and other resources.
  • GitLab or GitHub account for our project, but also access to other application project repositories that we will be deploying.
  • Artifactory or other binary repository to stage content packages, but also to maintain the different application versions
  • A VM or machine that will be running our automated tools. This can be a dedicated VM for this purpose, in our example we will create a Docker container for this vm.

    The machine/vm should have the following installed:

In the next part of our series we will set up our vm workspace, create our project structure, and dive right in. Each part of this series will cover implementing a different piece of the puzzle.  

While we are working on these pieces we could go ahead and get our teams thinking about what content and configurations they may want installed by default.