In my previous post, I talked about a proof of concept on developing a self-adapting web scraper.  As I was adding onto the project, I was having difficulty adding constraints for improving structure accuracy.  After some time, I came to one conclusion: My Initial Design Was Flawed!

Problem: Determining equality

The first issue with the design was how to determine the common structure.  In my original design, we would add the BeautifulSoup Tag structure to a set.  As each instance is evaluated, the Tag structure is added to the set and will only be added once if it’s unique.  If a single object is in the set after evaluating all instances, we found our structure and then generate a JSON structure based off of the HTML code.

With this design, the problem is that it does a literal comparison of everything.  This has the following effects:

  1. It can take longer to find the common structure.
  2. The structure can be much more bloated.
  3. The structure can also include HTML elements that are not relevant to the data references provided.

Solution: Generate a HTML Structure String Representation

Instead of comparing everything in hopes of getting a unique structure, we generate a string of tag names that represent the structure of the HTML code.  For example, the HTML code

<div class="instance">
    <p class="firstname">John</p>
    <p class="lastname">Doe</p>
</div>

will generate the string “divpp” as the output to be stored in a list/set.  This is repeated for each data reference instance until all strings are generated.  If there is only one item is in the list/set, then we found the structure.  This has a few benefits compared to the original design:

  1. By focusing on only tag names as a concatenated string, we can reduce the amount of memory needed for the comparison.
  2. We can derive a structure faster.
  3. By ignoring HTML tag attributes, we dramatically reducing structure complexity.

Problem: Knowing When To Stop Searching

Before find the common structure,  all the parents of each tag instance found are stored in a map.  In the original design, we would determine whether a structure was found by at each layer and placing them in the set.  If we use the HTML code example:

<div class="instance"> 
    <p class="firstname">John</p> 
    <p class="lastname">Doe</p> 
</div>
<div class="instance"> 
    <p class="firstname">Jane</p> 
    <p class="lastname">Roe</p> 
</div>

We could derive the following parent structures in the diagram below.  Our input is on the left, while our first and second passes go to the right.

The algorithm would stop comparing once we hit the box at the right.  This works well when all HTML elements are on the same level.  However, we end up hitting a problem when we start dealing with elements occurring at different levels.  Let’s take the following HTML code as an example:

<div class="instance">
    <div class="name">
        <p class="firstname">John</p> 
        <p class="lastname">Doe</p> 
    </div>
    <p class="gender>Male</p>
</div> 
<div class="instance"> 
    <div class="name">
        <p class="firstname">Jane</p> 
        <p class="lastname">Roe</p> 
    </div>
    <p class="gender>Female</p>
</div>

If we provide “Jane” and “Female” to our program, the algorithm would end up with the following steps:

In this case, the “Female” instance’s parent goes straight to the code on the right.  The “Jane” instance’s parent is the same as the parent from the previous example.  We reach the parent on the right on the second pass.  The problem in this case is that there the code might never be able to reach a common structure since one of them is always one parent farther up than the other.

Solution: Hash Table Data Structure For Parent Instances

In order to counter this problem, we store each parent instance in a map based off of the HTML structure string representation.  That way, if all instances have come across the same HTML structure, then we know that a common structure has been found.  An added bonus is that when a generate the JSON structure, we can just get the list from the hash table.

Problem: Unable to Simplify JSON Representation

Even if a HTML structure was found in the original design, correctly simplifying wasn’t always possible.  To generate the JSON representation, we would traverse the HTML code and write up each structure.  For children instances, if the returned structure was similar to a previous sibling instance, we wouldn’t add it.  However, if the attributes are different, it would treat the child instance as unique and add it to the parent’s children list.

Thus, generating child instances can cause the JSON structure representation to be bloated.

Solution: Redo Generating Algorithm

Due to how structures are changed, we would have to redo how the JSON structure was generated.  Here’s how the new algorithm worked:

  1. Create a new dictionary object (hash table object) with the tag name.
  2. If the tag has the class attribute, iterate through each child instance grabbing each class value and taking the intersection of all the class values.  If the set is not empty, then add the attribute to the dictionary object.
  3. If the HTML code contains children.  Do the following:
    1. Reorganize the child instances into a list that can be referenced by an index.  For example:
      <div class="instance">
          <div class="name"> 
              <p class="firstname">John</p> 
              <p class="lastname">Doe</p> 
          </div> 
      </div> 
      <div class="instance"> 
          <div class="name"> 
              <p class="firstname">Jane</p> 
              <p class="lastname">Roe</p> 
          </div> 
      </div>

      The list is created as:

      [[<div class=”name”><p class=”firstname”>John</p><p class=”lastname”>Doe</p></div>]
      [<div class=”name”><p class=”firstname”>Jane</p><p class=”lastname”>Roe</p></div> ]]

      When generating the child structure, the list is created as:

      [[<p class=”firstname”>John</p>,<p class=”lastname”>Doe</p>]
      [<p class=”firstname”>Jane</p>,<p class=”lastname”>Roe</p> ]]

    2.  Doing a numeric iteration, create another list that only refers to the ith child instances  and pass the list in the structure generating algorithm.  Repeat step 1.
    3. Add to the dictionary object.
  4. Return the structure.

Note that we can assume that all instances will have the exact same number of child instances.

Conclusion

As you grow as a software engineer under any specialization, you end up learning that a lot of your original designs on projects are rigid and ineffective.  I’m no exception as well.  The initial difficulties with AutoWeber forced me to rethink how the program was generating a representation of the HTML code.  With this follow-up, I was able to pinpoint the problems that I was hitting as well as provide solutions to overcome these problems.