Last year, I created the IssueHunt-Statistics website project on tracking repository, issues, and funding for open source projects.  Shortly after, however, the website changed and my project breaks down.  I did change the scraping code to bring back functionality, only for it to break down again a little while later.

I now have a problem.  I don’t want to always spend time constantly reworking the scraping code to make it functional.  I wonder if I could automate this task?

The Idea

The purpose of this adapting web scraper is to derive a common structure that can be scraped without constantly needing to rewrite code.  Let’s walkthrough an example.

Example

Suppose you wanted to parse the following HTML:

<html>
    <body>
        <div class="items">
            <div>
                <p class="name">Test1</p>
                <p class="isMet">Yes</p>
            </div>
            <div>
                <p class="name">Test2</p>
                <p class="isMet">No</p>
            </div>
            <div>
                <p class="name">Test3</p>
                <p class="isMet">Maybe</p>
            </div>
        </div>
    </body>
</html>

For this block of HTML, we want to derive a common structure for scraping the HTML.  However, we need to provide some data points that we want our program to find.  Without it, we could end up with a generalized structure.

In our example, we’ll provide “Test1” and “Test2” as our data references for the program.  Note that we can provide as much data as needed.

Feeding the data into the program, we grab the immediate tag that holds our data.  We get the following results:

<p class="name">Test1</p>
<p class="name">Test2</p>

As for deriving the structure, we can do this in two ways:

  1. Compare each of the tags for equality.  If they’re not equal, get the parent HTML and repeat until the structures are equal.
  2. Derive the basic structure such as tag name and attributes and determine if they’re the same.  If not, get the parent HTML and repeat until they share the same structure.

For this exercise, we’ll be using the first way in deriving.

You might say that the results above is the common structure.  However, since we’re using the first way, these lines aren’t the same due to the inner text.

Using these results, we get the previous parent of the tags and determine whether the entries are equal:

<div>
    <p class="name">Test1</p>
    <p class="isMet">Yes</p>
</div>
<div>
    <p class="name">Test2</p>
    <p class="isMet">No</p>
</div>

In this case, it can be argued that there is a common structure.  If we were using the second way of deriving, then yes, we would have a structure and the algorithm would stop here.  However, since we’re using the first way, these aren’t equal.

Going up another level to the previous level, we get the following:

<div class="items">
    <div>
        <p class="name">Test1</p>
        <p class="isMet">Yes</p>
    </div>
    <div>
        <p class="name">Test2</p>
        <p class="isMet">No</p>
    </div>
    <div>
        <p class="name">Test3</p>
        <p class="isMet">Maybe</p>
    </div>
</div>
<div class="items">
    <div>
        <p class="name">Test1</p>
        <p class="isMet">Yes</p>
    </div>
    <div>
        <p class="name">Test2</p>
        <p class="isMet">No</p>
    </div>
    <div>
        <p class="name">Test3</p>
        <p class="isMet">Maybe</p>
    </div>
</div>

Since these two are the same, we now have a structure that we can derive from.  If we only grab the tag name and the class, we’ll get the following structure:

{
    "name":"div",
    "attrs":{
        "class":["items" ]
    },
    "children":[
        {
            "name":"div",
            "children":[{
                "name":"p",
                "attrs":{
                    "class":["name"]
                }
            },
            {
                "name":"p",
                "attrs":{
                    "class":["isMet"]
            }
        }]
    }]
}

We can now write up web scraping code to get the data needed.

Considerations

Due to there being no specific structure for building websites, there can be all sort of things that can go wrong when attempting to deriving a structure.  Here’s some things that should be noted.

HTML Attributes

In our example, we only had one class attribute that was defined for several of the tags.  If we had multiple classes that differ across numerous tags, our structure would become more bloated.  Grabbing too many attributes, such as id and href, can make deriving a common structure pointless since we could end up just grabbing each entry.  Thus, when determining the structure, it would be best to stick to the minimum needed to get a common structure.

Going About Deriving

We can separate this into two ways:

  1. The data provided to start the derive.
  2. How to derive.

The quality of the data points provided before deriving a structure can dramatically determine the structure that’ll be returned.  If you don’t provide enough data, it might stop too early to derive a meaningful structure or you’ll end up with a generalized structure that would be too verbose to be of any use.  Thus, providing enough relevant data point is necessary for grabbing good structures.

Recall that I mentioned about how to derive a structure:

  1. Compare each of the tags for equality.  If they’re not equal, get the parent HTML and repeat until the structures are equal.
  2. Derive the basic structure such as tag name and attributes and determine if they’re the same.  If not, get the parent HTML and repeat until they share the same structure.

In our example, the data provided to use suggests that it was best to use the first method.  Had we decided to use the second method, we would have immediately stopped once we figured out the immediate tags to our inner text.  Had we also provide data that was within tags with the class attribute “isMet”, the second method would have been able derive the structure faster.

Adding Onto the Idea

Even if we can derive a structure, manually providing text for the program to work off of just isn’t efficient.  What if we were visiting a website for the first instead?  What if the data points we provided was no longer valid for referencing?  These questions would have to be addressed if the concept is to be more useful.

Additionally, while it help to know the structure of what we want to scrape, we ultimately want to generate code that will to the scraping.  How do we go about this?

I don’t have the answers right now, but with time, something interesting should come up.

I currently have a repository on this concept on my github account if you want to check it out.