Gideon Kimutai ( A freelance Django web developer

September 17, 2017


Scraping The Python Nairobi Blog

Scrapy is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival.

so lets get started.


  1. you need to have python installed on your computer. here is the link to the official site
  2. create a virtual enviroment, steps can be found here click me
  3. activate your created virtualenvironment and install scrapy framework by running the command pip install scrapy.

NB if everything is installed correctly you are good to start scraping some data.

In this tutorial we will be scraping the Python Nairobi Blog, you may apply the concepts in this tutorial to scrape other websites too. You will also have to read further on your own since i will only cover the basics

check again if scrapy is available in your current environment by typing scrapy in your terminal. If you see some bunch of help options then you are good to go.

we are going to create our project. to do that type

$ scrapy createproject naiblog

after creating the project minimize your terminal and you should see a directory named naiblog.

the directory structure of naiblob should look like this:-

    naiblog/ # deploy configuration file project's Python module, you'll import your code from here # project items definition file # project pipelines file # project settings file
   # a directory where you'll later put your spiders

create a file named inside the folder naiblog/spiders/

now copy the contents below to the new file.

import scrapy

class NailblogSpider(scrapy.Spider):
    name = 'naiblog'

    start_urls = [

    def parse(self, response):

you need to note some few things here.

  • our spider subcalsses scrapy.Spider.
  • name=naiblog this is the name of our spider, it should be unique within this project.
  • start_urls[] is a class attribute and this where you write a list of urls that you will crawl. this is just one of the ways.
  • parse() is a method that will be called to handle response download from each request.

now you atleast have the basics, so change the contents of the parse() method to be the same as the one below.

class NaiblogSpider(spider.Spider):

    def parse(self, response):
        sub_header = response.xpath("//div//h1[@class='content-subhead']/text()").extract_first()
        print sub_header

on your terminal navigate to the root/base directory of the naiblog project and type

$ scrapy crawl naiblog --nolog


$ Latest posts

yeey you just extracted the content subheading of the

just open the above url in your browser and right click on the heading Latest Posts then select the option inspect. you will see that the Post List is wrapped inside

<div class="posts"
    <h1 class="content-subhead">
        Latest posts

NB: when extracting contents from a response we use css or xpath Selectors to select elements from the downloaded response.

I find xpath more powerful rather than using css, you will also have to read more on yourself to understand Xpath expressions.

now to the final part of our spider, comments on the code will help you get the idea of what is happening

import scrapy

class NaiblogSpider(scrapy.Spider):
    name = 'naiblog'

    start_urls = [

    def parse(self, response):
        # lets start by extracting the root node /html
        # from the response.
        html = response.xpath("/html")

        # select the body element which is inside
        # the root node html
        body = html.xpath("//body")

        # lets now get the content element
        content = body.xpath("//div[@class='content']")

        # extract the posts element
        posts = content.xpath("//div[@class='posts']")

        # iterate over the posts element in order to get
        # each individual post by extracting section node

        for post in posts.xpath("section[@class='post']"):

            # get the node header
            header = post.xpath("header[@class='post-header']")
            # finally get the title text
            title = header.xpath("h3 /a/text()").extract_first()

            # to get the description is the hardest part, if you inspect the element
            # of the page you will notice that there are two <p class='post-meta'> and there
            # is a <p> element without any attribute between them which holds a simple description
            # of the post. Basically it makes it hard to extract the description of the post.
            # I suggest the developers of the web app should look into it.
            # with that issue in mind I did some tweak below in order to get the empty <p>
            # tag.
            description = header.xpath("p[@class='post-meta']|p/text()").extract()[1]

            # extract post category
            category = header.xpath("p[@class='post-meta'] /a/text()").extract_first()

            # extract post date
            date = header.xpath("p[@class='post-meta'] /text()")[-1].extract()

            # finally lets return some data
            yield {
                  'description': description, 'category': category, 'title': title, 'date': date

lets execute our spider

$ scrapy crawl naiblog -o posts.json

you will see lots of logs being displayed on your terminal, when there are no more logs displayed navigate to the root of naiblog project and you will see a file named posts.json, open it and you will see all the posts in pynbo blog.

Thats all!!

find the entire project on my git repo naiblog.

Go Top