Dhruv
Dhruv Gupta

An avid traveler with a nerdy streak!

HTML DOM Validator

Next.jsHTML DOMTestingPlayground

HTML DOM Validator can be used to verify structural integrity of HTML documents by validating them against specified rules. Using the HTML Validation Language, users can define custom rules to match their requirements.

Project Image

Contents


Overview


HTML DOM Validator can be used to verify the structural integrity of HTML documents by validating them against specified rules. Using the HTML Validation Language, users can define custom rules to match their requirements. The project provides a playground where users can input HTML code and define rules to validate the structure of the DOM. The validation algorithm parses the HTML document into a tree structure and matches it against the specified rules to identify any discrepancies. Detailed error messages are generated in case of mismatches, helping users identify and rectify issues in the HTML document. image

Approach


To create a generalized solution to match specific parts of an HTML DOM, I thought that if I could define a set of rules and then use these rules to validate the DOM structure of HTML documents, it would be a great solution. This would allow users to define custom rules to match their requirements and this is how I came up with the HTML Validation Language (HVL) and the validation algorithm. This validation process is divided into 2 steps:

1. Parsing

The HTML document is parsed into a tree structure that is easy to traverse and validate. This parsing is done in two steps and to understand the process, let's consider the following HTML code.

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Document</title>
</head>
<body>
    <div id="main">
        <div id="body">
            <p id="mainpara" class="para para1"></p>
            <p class="para">a</p>
            <p class="para">a</p>
            <p class="para"> </p>
        </div>
    </div>
    <select name="" id="">
        <option value="">Option 1</option>
        <option value="">Option 2</option>
    </select>
</body>
</html>

Step 1: Flattening

The HTML code is first flattened to remove all the unnecessary information and every useful information is stored in the form of attributes. Indentation is used to denote nesting levels and the text of the element is stored in the text attribute. For simplicity, lets call this flattened form as HTML validation language (HVL). This translation is done using the Parser.js file which is located in src\Core\Helpers\.

The HVL for the above HTML code is as follows:

html(lang="en" text="")
    head( text="")
        meta(charset="UTF-8" text="")
        meta(name="viewport" content="width=device-width, initial-scale=1.0" text="")
        title( text="Document")
    body( text="")
        div(id="main" text="")
            div(id="body" text="")
                p(id="mainpara" class="para para1" text="")
                p(class="para" text="a")
                p(class="para" text="a")
                p(class="para" text="")
        select(name="" id="" text="")
            option(value="" text="Option 1")
            option(value="" text="Option 2")

Step 2: Parsing to tree

Now, this reduced structure is converted into a tree to efficiently represent children. An array data structure is used to store this tree where every element is an object with 3 keys

This reduction to tree is done using the General.js file which is located in src\Core\Helpers.

Translation of the above HVL to tree structure is as follows:

[
    ...
    {
        "tag": "div",
        "attributes": {
            "id": "body",
            "text": ""
        },
        "children": [8,9,10,11]
    },
    {
        "tag": "p",
        "attributes": {
            "id": "mainpara",
            "class": [
                "para",
                "para1"
            ],
            "text": ""
        },
        "children": []
    },
  ...
]

Now, intuitively, we can represent the rules in the flattened form / HVL and then use the tree to validate the rules.

2. Validation

The tree structure is used to validate the rules defined using the HTML Validation Language. The rules are defined in a simple syntax and are used to match the structure of the HTML document. Any number of rules can be defined that can either correspond to the entire document or a specific part of the document.

Rule Matching

HTML Validation Language(HVL)


The HTML Validation Language (HVL) allows you to define rules for validating the DOM structure of HTML documents. It provides a simple syntax to specify elements, attributes, and validation conditions. Below are the rules and syntax for using HVL.

1. Element Declaration

Specify the element name followed by its attributes in parentheses. If an element has no attributes, use empty parentheses.

elementName(attribute1="value1" attribute2="value2")

2. Nesting Levels

Use tab to denote nesting levels. Each level of indentation represents a child element. After all children in the Document Object Model (DOM) have been matched with the specified rule, any additional children present in the DOM will be disregarded.

div(id="body")
	h1(class="main_heading")
 
This represents that h1 with class main_heading is a children of div with id body.

3. Randomization of Element Order

For scenarios where the order of child elements is not essential, the random="true" attribute is employed within the element declaration to allow for the randomization of child element order.

elementName(random="true")
	childElement()
	childElement()
	childElement()

4. HREF Attribute Matching

To match href attribute starting with specific prefixes for URL validation, use hrefStartsWith:"{STARTING_PREFIX}".

elementName(hrefStartsWith:"http")

5. Text Matching with Regular Expressions

To match text within an element using regular expressions, use text="{REGEX_HERE}" , To match completely enter the entire string.

elementName(text="[abc]") // Matches any of the characters within the brackets.
elementName(text="Probem Of the Day") // Matches the entire string

6. Choice in HTML Tag

To provide user a choice between different HTML tags, use choice="tag1 tag2 ...".

elementName(choice="tag1 tag2 ...")

7. Match ID or Class with text of Kth Child

To match the ID or class of an element with the text of the K'th child, use matchIdK="K,$" or matchClassK="K,$" , where "$" is an operator that will replace the blank spaces in the text of K’th child.

elementName(matchIdK="K,$")
elementName(matchClassK="K,$")

Example Usage


Following are the rules to represent various aspects of the below HTML code.

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Sample HTML Document</title>
</head>
<body>
    <div class="container" id="main">
        <h1>Welcome</h1>
        <p>This is a sample HTML document.</p>
        <div class="content">
            <h2>Content Section</h2>
            <p>This section contains some content.</p>
            <a class="link" href="https://example.com">Visit Example</a>
            <ul class="main_list" id="Item_3">
                <li>Item 1</li>
                <li>Item 2</li>
                <li>Item 3</li>
            </ul>
        </div>
    </div>
</body>
</html>
  1. Nested Elements

    This rule matches the <h1> and <p> elements nested inside the <div> with the class "container".

    div(class="container")
        h1()
        p()
  2. Randomization

    This rule matches <ul> element with the class "main_list" and 3 <li> as children such that their order is not significant.

    ul(class="main_list" random="true")
        li()
        li()
        li()
  3. HREF Attribute Matching

    This rule matches <a> element with class link whose href attribute must start with "https://".

    a(hrefStartsWith:"https://" class="link")
  4. Text Matching

    This rule validates that there must be atleast one <p> element in the DOM such that it has the text "sample HTML document” .

    p(text="sample HTML document")
  5. Choice in HTML Tag

    This rule matches an element with id "main" such that it can be either a <div> or a <section>.

    div(id="main" choice=”section”)
  6. Match ID or Class with text of Kth Child

    This rule matches an element <ul> with class main_list such that its id is same as the text of the 3rd child and replaces the blank spaces with "_".

    ul(matchIdK="3,_" class="main_list")

Local Setup and Development


This is a Next.js project bootstrapped with create-next-app.

Setting up the project

  1. Please make sure you have Node.js installed on your system.

  2. Clone the repository and navigate to the root directory of the project.

  3. Run the following command to install the required dependencies.

    npm install
    # or
    yarn install
  4. Run the following command to start the development server.

    npm run dev
    # or
    yarn dev
  5. Open http://localhost:3000 with your browser to see the result.

Project Structure

├──src
   ├──app
   ├──page.js
   ├──components
   ├──Navbar.js
   ├──Core
   ├──Helpers
   ├──Attribute.js (Attribute helper functions)
   ├──General.js (General helper functions)
   ├──Parser.js (HTML Parser)
   ├──Validator.js (Rule validation logic)

Made with ❤️ by Dhruv