Tuesday, September 1, 2020

Checking Links with Scrapy

 Whether you are supporting an AEM project or any other web project one of the regular tasks someone on the team should perform is to check all the links on the site before go-live so the authoring team can correct them.

If we only have a few pages this can be a pretty easy task for someone to manually perform, but for large sites with hundreds or thousands of pages can be a very tall task even for the whole team to perform. One strategy that is commonly used is to take all known URLs and/or redirects, put them in a file and a quick curl script to iterate over them; but what about unknown links that may have been created or links returning error codes.

 Luckily we can use the open-source scrapy framework to create a custom crawler to check all the links for us and output a report that we can then give to the authoring team. Scrapy is a Python framework that provides both command line utilities with prebuilt spiders and the ability to customize the spiders to our specific needs. While the framework is originally written to scrape data from a website we can also use it for our purposes as well.


Before we get started you will need a system with Python3 and PyPI (pip) installed to use as our development environment.

Install Scrapy using pip

$ pip install scrapy


Normally at this step you would create a folder for all our project files, which you can certainly do. For this project we will use the scrapy cmd line util to create our initial project structure for us.

$ scrapy startproject linkCrawler

 The command above, scrapy will create the initial project 'linkCrawler' and all necessary files to get started. If we take a look at the structure it should look like the following:

$ tree .

.

├── linkCrawler

│   ├── __init__.py

│   ├── items.py

│   ├── middlewares.py

│   ├── pipelines.py

│   ├── settings.py

│   └── spiders

│       └── __init__.py

└── scrapy.cfg


2 directories, 7 files




Next, we need to create our first spider. Again, we will use the scrapy utility to generate our spider for us and we will customize it to our needs later.

$ cd linkCrawler


$ scrapy genspider -t crawl example example.com


Created spider 'example' using template 'crawl' in module:

  linkCrawler.spiders.example


Let's break down the command above:

scrapy genspider - tells scrapy to generate a spider for us
-t crawl - tells scrapy to use the 'crawl' template when creating the spider
example - our spider's name.
example.com - The domain our spider will crawl


Let's take a look at our project structure. We can see scrapy added our spider example.py.

$ tree .

.

├── linkCrawler

│   ├── __init__.py

│   ├── items.py

│   ├── middlewares.py

│   ├── pipelines.py

│   ├── settings.py

│   └── spiders

│       ├── __init__.py

│       └── example.py

└── scrapy.cfg


2 directories, 11 files


The two files we will concentrate on is the 'settings.py' and 'example.py' files.

Since we are crawling our own site and we want to check ALL the links on our site, we want to tell our spider to disregard the rules in the robots.txt file. Normally we would want to obey by the rules in the robots.txt file especially if we do not own the site. 

We can turn this feature off  by opening the settings.py file and looking for the following line and changing to 'False'.

# Obey robots.txt rules

ROBOTSTXT_OBEY = False


Save the file. 

Next we will modify our spider to check for bad links and create a report.

Opening the example.py file  we can see the basic structure is already created for us.

import scrapy

from scrapy.linkextractors import LinkExtractor

from scrapy.spiders import CrawlSpider, Rule



class ExampleSpider(CrawlSpider):

    name = 'example'

    allowed_domains = ['example.com']

    start_urls = ['http://example.com/']


    rules = (

        Rule(LinkExtractor(allow=r'Items/'), callback='parse_item', follow=True),

    )


    def parse_item(self, response):

        item = {}

        #item['domain_id'] = response.xpath('//input[@id="sid"]/@value').get()

        #item['name'] = response.xpath('//div[@id="name"]').get()

        #item['description'] = response.xpath('//div[@id="description"]').get()

        return item



First lets add an object to hold our report data items. 

class BadLinks(Item):

    referer = Field()

    url = Field()

    status = Field()

    dispatcher = Field()



So that our report is useful we want to capture a few data items:
  • Referer: What page were we on when the link was followed
  • URL: What is the link that was followed
  • Status: The HTTP status code returned when we crawled the link
  • Dispatcher: The value of the 'X-Dispatcher' header. This will tell us what dispatcher/publish pair had an issue so that we can investigate if there is a problem with that pair.

In our ExampleSpider class,  we will add support for non HTTP 200 codes.

class ExampleSpider(CrawlSpider):

    name = 'example'

    allowed_domains = ['example.com']

    start_urls = ['http://example.com/']

    handle_httpstatus_list = [404,410,301,500]


Here is an explanation of what is going on in these few lines:
  • name: This is the name of this spider and must be unique.
  • allowed_domains: List of the domains we want the spider to crawl. We could include any subdomains or other domains linked to this site that we own and want to check.
  • start_urls: List of urls where we want crawling to begin at. Since we want the entire site we will leave this at the root of the site.
  • handle_httpstatus_list: List of http status codes outside of the 200-300 range that we want this spider to handle.
Lets add a few rules telling our spider what it should crawl.

rules = [

    Rule(

      LinkExtractor(allow_domains=allowed_domains, deny=('/media/*'), unique=('Yes')),

      callback='parse_item',

      follow=True

    ),

    Rule(

      LinkExtractor(allow=(''),unique=('Yes')),

      callback='parse_item',

      follow=False

    )

]




In the above section of code we have two Rule objects using the LinkExtractor object. 

The first rule we specify the allow_domains to match our variable, we add a deny for the /media/ section of the site, and that we only wish to capture unique values.

The second Rule object allows all links to be identified but not followed. This will cover our external links in the output but we won't crawl those sites.

In the callback we call 'parse_items' to identify how the link is to be handled. So let's modify that method to record the data for bad links.

 def parse_item(self, response):

        report_if = [404,500]

        if response.status in report_if:

            item = BadLinks()

            item['referer'] = response.request.headers.get('Referer', None)

            item['status'] = response.status

            item['response'] = response.url

            item['dispatcher'] = response.headers.get('X-Dispatcher', None)

            yield item

        yield None



In the section above we check if the response status code is a 404 or 500, if it is we parse the values we want for our report.

Our complete spider should now look like the following:

# -*- coding: utf-8 -*-

import scrapy

from scrapy.linkextractors import LinkExtractor

from scrapy.spiders import CrawlSpider, Rule

from scrapy.item import Item, Field


class BadLinks(Item):

    referer = Field()

    response = Field()

    status = Field()

    dispatcher = Field()


class ExampleSpider(CrawlSpider):

    name = 'example'

    allowed_domains = ['example.com']

    start_urls = ['http://example.com/']

    handle_httpstatus_list = [404,410,301,500]


    rules = [

        Rule(

           LinkExtractor(allow_domains=allowed_domains, deny=('/media/*'), unique=('Yes')),

           callback='parse_item',

           follow=True

        ),

        Rule(

           LinkExtractor(allow=(''),unique=('Yes')),

           callback='parse_item',

           follow=False

        )

    ]


   def parse_item(self, response):

        report_if = [404,500]

        if response.status in report_if:

            item = CrawlItems()

            item['referer'] = response.request.headers.get('Referer', None)

            item['status'] = response.status

            item['response'] = response.url

            item['dispatcher'] = response.headers.get('X-Dispatcher', None)

            yield item

        yield None



We are now ready to run our spider. Save the file and start our spider with the following command:

$ scrapy crawl example -o report.csv



While the spider is running you will see debug info sent to stdout and any links resulting in a 404 or 500 will be captured in our report.

From here we could add additional spiders to this project to handle checking 301 redirects, warm cache, or scrape data from our pages.

No comments:

Post a Comment