Overview

HTML DOM Validator can be used to verify the structural integrity of HTML documents by validating them against specified rules. Using the HTML Validation Language, users can define custom rules to match their requirements. The project provides a playground where users can input HTML code and define rules to validate the structure of the DOM. The validation algorithm parses the HTML document into a tree structure and matches it against the specified rules to identify any discrepancies. Detailed error messages are generated in case of mismatches, helping users identify and rectify issues in the HTML document.

Approach

To create a generalized solution to match specific parts of an HTML DOM, I thought that if I could define a set of rules and then use these rules to validate the DOM structure of HTML documents, it would be a great solution. This would allow users to define custom rules to match their requirements and this is how I came up with the HTML Validation Language (HVL) and the validation algorithm. This validation process is divided into 2 steps:

1. Parsing

The HTML document is parsed into a tree structure that is easy to traverse and validate. This parsing is done in two steps and to understand the process, let's consider the following HTML code.

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Document</title>
</head>
<body>
    <div id="main">
        <div id="body">
            <p id="mainpara" class="para para1"></p>
            <p class="para">a</p>
            <p class="para">a</p>
            <p class="para"> </p>
        </div>
    </div>
    <select name="" id="">
        <option value="">Option 1</option>
        <option value="">Option 2</option>
    </select>
</body>
</html>

Step 1: Flattening

The HTML code is first flattened to remove all the unnecessary information and every useful information is stored in the form of attributes. Indentation is used to denote nesting levels and the text of the element is stored in the text attribute. For simplicity, lets call this flattened form as HTML validation language (HVL). This translation is done using the Parser.js file which is located in src\Core\Helpers\.

The HVL for the above HTML code is as follows:

html(lang="en" text="")
    head( text="")
        meta(charset="UTF-8" text="")
        meta(name="viewport" content="width=device-width, initial-scale=1.0" text="")
        title( text="Document")
    body( text="")
        div(id="main" text="")
            div(id="body" text="")
                p(id="mainpara" class="para para1" text="")
                p(class="para" text="a")
                p(class="para" text="a")
                p(class="para" text="")
        select(name="" id="" text="")
            option(value="" text="Option 1")
            option(value="" text="Option 2")

Step 2: Parsing to tree

Now, this reduced structure is converted into a tree to efficiently represent children. An array data structure is used to store this tree where every element is an object with 3 keys

tag : The tag name of the element
attributes : The attributes of the element
children : The array indices of all the children of the element.

This reduction to tree is done using the General.js file which is located in src\Core\Helpers.

Translation of the above HVL to tree structure is as follows:

[
    ...
    {
        "tag": "div",
        "attributes": {
            "id": "body",
            "text": ""
        },
        "children": [8,9,10,11]
    },
    {
        "tag": "p",
        "attributes": {
            "id": "mainpara",
            "class": [
                "para",
                "para1"
            ],
            "text": ""
        },
        "children": []
    },
  ...
]

Now, intuitively, we can represent the rules in the flattened form / HVL and then use the tree to validate the rules.

2. Validation

The tree structure is used to validate the rules defined using the HTML Validation Language. The rules are defined in a simple syntax and are used to match the structure of the HTML document. Any number of rules can be defined that can either correspond to the entire document or a specific part of the document.

Rule Matching

For every rule, the first matching tag is found by traversing the elements array and then a recursive function is called to validate other HVL specific attributes and children. It is important to note that a pair of tag is said to be matching if the tag name and all the html attributes are same.
The recursive function is called for every child of the rule and the rule is said to be valid if all the children are valid.
For any mismatch, detailed error messages are generated.
The complete matching algorithm is implemented in the Validator.js file which is located in src\Core\.

HTML Validation Language(HVL)

The HTML Validation Language (HVL) allows you to define rules for validating the DOM structure of HTML documents. It provides a simple syntax to specify elements, attributes, and validation conditions. Below are the rules and syntax for using HVL.

1. Element Declaration

Specify the element name followed by its attributes in parentheses. If an element has no attributes, use empty parentheses.

elementName(attribute1="value1" attribute2="value2")

2. Nesting Levels

Use tab to denote nesting levels. Each level of indentation represents a child element. After all children in the Document Object Model (DOM) have been matched with the specified rule, any additional children present in the DOM will be disregarded.

div(id="body")
	h1(class="main_heading")
 
This represents that h1 with class main_heading is a children of div with id body.

3. Randomization of Element Order

For scenarios where the order of child elements is not essential, the random="true" attribute is employed within the element declaration to allow for the randomization of child element order.

elementName(random="true")
	childElement()
	childElement()
	childElement()

4. HREF Attribute Matching

To match href attribute starting with specific prefixes for URL validation, use hrefStartsWith:"{STARTING_PREFIX}".

elementName(hrefStartsWith:"http")

5. Text Matching with Regular Expressions

To match text within an element using regular expressions, use text="{REGEX_HERE}" , To match completely enter the entire string.

elementName(text="[abc]") // Matches any of the characters within the brackets.
elementName(text="Probem Of the Day") // Matches the entire string

6. Choice in HTML Tag

To provide user a choice between different HTML tags, use choice="tag1 tag2 ...".

elementName(choice="tag1 tag2 ...")

7. Match ID or Class with text of Kth Child

To match the ID or class of an element with the text of the K'th child, use matchIdK="K,$" or matchClassK="K,$" , where "$" is an operator that will replace the blank spaces in the text of K’th child.

elementName(matchIdK="K,$")
elementName(matchClassK="K,$")

Example Usage

Following are the rules to represent various aspects of the below HTML code.

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Sample HTML Document</title>
</head>
<body>
    <div class="container" id="main">
        <h1>Welcome</h1>
        <p>This is a sample HTML document.</p>
        <div class="content">
            <h2>Content Section</h2>
            <p>This section contains some content.</p>
            <a class="link" href="https://example.com">Visit Example</a>
            <ul class="main_list" id="Item_3">
                <li>Item 1</li>
                <li>Item 2</li>
                <li>Item 3</li>
            </ul>
        </div>
    </div>
</body>
</html>

Nested Elements

This rule matches the <h1> and <p> elements nested inside the <div> with the class "container".
```
div(class="container")
    h1()
    p()
```
Randomization

This rule matches <ul> element with the class "main_list" and 3 <li> as children such that their order is not significant.
```
ul(class="main_list" random="true")
    li()
    li()
    li()
```
HREF Attribute Matching

This rule matches <a> element with class link whose href attribute must start with "https://".
```
a(hrefStartsWith:"https://" class="link")
```
Text Matching

This rule validates that there must be atleast one <p> element in the DOM such that it has the text "sample HTML document” .
```
p(text="sample HTML document")
```
Choice in HTML Tag

This rule matches an element with id "main" such that it can be either a <div> or a <section>.
```
div(id="main" choice=”section”)
```
Match ID or Class with text of Kth Child

This rule matches an element <ul> with class main_list such that its id is same as the text of the 3rd child and replaces the blank spaces with "_".
```
ul(matchIdK="3,_" class="main_list")
```

Local Setup and Development

This is a Next.js project bootstrapped with create-next-app.

Setting up the project

Please make sure you have Node.js installed on your system.
Clone the repository and navigate to the root directory of the project.
Run the following command to install the required dependencies.
```
npm install
# or
yarn install
```
Run the following command to start the development server.
```
npm run dev
# or
yarn dev
```
Open http://localhost:3000 with your browser to see the result.

Project Structure

├──src
│   ├──app
│   │   ├──page.js
│   ├──components
│   │   ├──Navbar.js
│   ├──Core
│   │   ├──Helpers
│   │   │   ├──Attribute.js (Attribute helper functions)
│   │   │   ├──General.js (General helper functions)
│   │   │   ├──Parser.js (HTML Parser)
│   │   ├──Validator.js (Rule validation logic)

HTML DOM Validator

Contents