Whether you are supporting an AEM project or any other web project one of the regular tasks someone on the team should perform is to check all the links on the site before go-live so the authoring team can correct them.
If we only have a few pages this can be a pretty easy task for someone to manually perform, but for large sites with hundreds or thousands of pages can be a very tall task even for the whole team to perform. One strategy that is commonly used is to take all known URLs and/or redirects, put them in a file and a quick curl script to iterate over them; but what about unknown links that may have been created or links returning error codes.
Luckily we can use the open-source scrapy framework to create a custom crawler to check all the links for us and output a report that we can then give to the authoring team. Scrapy is a Python framework that provides both command line utilities with prebuilt spiders and the ability to customize the spiders to our specific needs. While the framework is originally written to scrape data from a website we can also use it for our purposes as well.
Before we get started you will need a system with Python3 and PyPI (pip) installed to use as our development environment.
Install Scrapy using pip
$ pip install scrapy
Normally at this step you would create a folder for all our project files, which you can certainly do. For this project we will use the scrapy cmd line util to create our initial project structure for us.
$ scrapy startproject linkCrawlerThe command above, scrapy will create the initial project 'linkCrawler' and all necessary files to get started. If we take a look at the structure it should look like the following:
$ tree .
.
├── linkCrawler
│ ├── __init__.py
│ ├── items.py
│ ├── middlewares.py
│ ├── pipelines.py
│ ├── settings.py
│ └── spiders
│ └── __init__.py
└── scrapy.cfg
2 directories, 7 files
Next, we need to create our first spider. Again, we will use the scrapy utility to generate our spider for us and we will customize it to our needs later.
$ cd linkCrawler
$ scrapy genspider -t crawl example example.com
Created spider 'example' using template 'crawl' in module:
linkCrawler.spiders.example
Let's break down the command above:
$ tree .
.
├── linkCrawler
│ ├── __init__.py
│ ├── items.py
│ ├── middlewares.py
│ ├── pipelines.py
│ ├── settings.py
│ └── spiders
│ ├── __init__.py
│ └── example.py
└── scrapy.cfg
2 directories, 11 files
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class ExampleSpider(CrawlSpider):
name = 'example'
allowed_domains = ['example.com']
start_urls = ['http://example.com/']
rules = (
Rule(LinkExtractor(allow=r'Items/'), callback='parse_item', follow=True),
)
def parse_item(self, response):
item = {}
#item['domain_id'] = response.xpath('//input[@id="sid"]/@value').get()
#item['name'] = response.xpath('//div[@id="name"]').get()
#item['description'] = response.xpath('//div[@id="description"]').get()
return item
class BadLinks(Item):
referer = Field()
url = Field()
status = Field()
dispatcher = Field()
So that our report is useful we want to capture a few data items:
- Referer: What page were we on when the link was followed
- URL: What is the link that was followed
- Status: The HTTP status code returned when we crawled the link
- Dispatcher: The value of the 'X-Dispatcher' header. This will tell us what dispatcher/publish pair had an issue so that we can investigate if there is a problem with that pair.
class ExampleSpider(CrawlSpider):
name = 'example'
allowed_domains = ['example.com']
start_urls = ['http://example.com/']
handle_httpstatus_list = [404,410,301,500]
- name: This is the name of this spider and must be unique.
- allowed_domains: List of the domains we want the spider to crawl. We could include any subdomains or other domains linked to this site that we own and want to check.
- start_urls: List of urls where we want crawling to begin at. Since we want the entire site we will leave this at the root of the site.
- handle_httpstatus_list: List of http status codes outside of the 200-300 range that we want this spider to handle.
rules = [
Rule(
LinkExtractor(allow_domains=allowed_domains, deny=('/media/*'), unique=('Yes')),
callback='parse_item',
follow=True
),
Rule(
LinkExtractor(allow=(''),unique=('Yes')),
callback='parse_item',
follow=False
)
]
In the above section of code we have two Rule objects using the LinkExtractor object.
def parse_item(self, response):
report_if = [404,500]
if response.status in report_if:
item = BadLinks()
item['referer'] = response.request.headers.get('Referer', None)
item['status'] = response.status
item['response'] = response.url
item['dispatcher'] = response.headers.get('X-Dispatcher', None)
yield item
yield None
In the section above we check if the response status code is a 404 or 500, if it is we parse the values we want for our report.
# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy.item import Item, Field
class BadLinks(Item):
referer = Field()
response = Field()
status = Field()
dispatcher = Field()
class ExampleSpider(CrawlSpider):
name = 'example'
allowed_domains = ['example.com']
start_urls = ['http://example.com/']
handle_httpstatus_list = [404,410,301,500]
rules = [
Rule(
LinkExtractor(allow_domains=allowed_domains, deny=('/media/*'), unique=('Yes')),
callback='parse_item',
follow=True
),
Rule(
LinkExtractor(allow=(''),unique=('Yes')),
callback='parse_item',
follow=False
)
]
def parse_item(self, response):
report_if = [404,500]
if response.status in report_if:
item = CrawlItems()
item['referer'] = response.request.headers.get('Referer', None)
item['status'] = response.status
item['response'] = response.url
item['dispatcher'] = response.headers.get('X-Dispatcher', None)
yield item
yield None
$ scrapy crawl example -o report.csv
While the spider is running you will see debug info sent to stdout and any links resulting in a 404 or 500 will be captured in our report.
From here we could add additional spiders to this project to handle checking 301 redirects, warm cache, or scrape data from our pages.
No comments:
Post a Comment