Contents
Overview
HTML DOM Validator can be used to verify the structural integrity of HTML documents by validating them against specified rules. Using the HTML Validation Language, users can define custom rules to match their requirements. The project provides a playground where users can input HTML code and define rules to validate the structure of the DOM. The validation algorithm parses the HTML document into a tree structure and matches it against the specified rules to identify any discrepancies. Detailed error messages are generated in case of mismatches, helping users identify and rectify issues in the HTML document.
Approach
To create a generalized solution to match specific parts of an HTML DOM, I thought that if I could define a set of rules and then use these rules to validate the DOM structure of HTML documents, it would be a great solution. This would allow users to define custom rules to match their requirements and this is how I came up with the HTML Validation Language (HVL) and the validation algorithm. This validation process is divided into 2 steps:
1. Parsing
The HTML document is parsed into a tree structure that is easy to traverse and validate. This parsing is done in two steps and to understand the process, let's consider the following HTML code.
Step 1: Flattening
The HTML code is first flattened to remove all the unnecessary information and every useful information is stored in the form of attributes. Indentation is used to denote nesting levels and the text of the element is stored in the text
attribute. For simplicity, lets call this flattened form as HTML validation language (HVL). This translation is done using the Parser.js
file which is located in src\Core\Helpers\
.
The HVL for the above HTML code is as follows:
Step 2: Parsing to tree
Now, this reduced structure is converted into a tree to efficiently represent children. An array data structure is used to store this tree where every element is an object with 3 keys
tag
: The tag name of the elementattributes
: The attributes of the elementchildren
: The array indices of all the children of the element.
This reduction to tree is done using the General.js
file which is located in src\Core\Helpers
.
Translation of the above HVL to tree structure is as follows:
Now, intuitively, we can represent the rules in the flattened form / HVL
and then use the tree to validate the rules.
2. Validation
The tree structure is used to validate the rules defined using the HTML Validation Language. The rules are defined in a simple syntax and are used to match the structure of the HTML document. Any number of rules can be defined that can either correspond to the entire document or a specific part of the document.
Rule Matching
- For every rule, the first matching tag is found by traversing the elements array and then a recursive function is called to validate other HVL specific attributes and children. It is important to note that a pair of tag is said to be matching if the tag name and all the html attributes are same.
- The recursive function is called for every child of the rule and the rule is said to be valid if all the children are valid.
- For any mismatch, detailed error messages are generated.
- The complete matching algorithm is implemented in the
Validator.js
file which is located insrc\Core\
.
HTML Validation Language(HVL)
The HTML Validation Language (HVL) allows you to define rules for validating the DOM structure of HTML documents. It provides a simple syntax to specify elements, attributes, and validation conditions. Below are the rules and syntax for using HVL.
1. Element Declaration
Specify the element name followed by its attributes in parentheses. If an element has no attributes, use empty parentheses.
2. Nesting Levels
Use tab to denote nesting levels. Each level of indentation represents a child element. After all children in the Document Object Model (DOM) have been matched with the specified rule, any additional children present in the DOM will be disregarded.
3. Randomization of Element Order
For scenarios where the order of child elements is not essential, the random="true"
attribute is employed within the element declaration to allow for the randomization of child element order.
4. HREF Attribute Matching
To match href
attribute starting with specific prefixes for URL validation, use hrefStartsWith:"{STARTING_PREFIX}"
.
5. Text Matching with Regular Expressions
To match text within an element using regular expressions, use text="{REGEX_HERE}"
, To match completely enter the entire string.
6. Choice in HTML Tag
To provide user a choice between different HTML tags, use choice="tag1 tag2 ..."
.
7. Match ID or Class with text of Kth Child
To match the ID or class of an element with the text of the K'th child, use matchIdK="K,$"
or matchClassK="K,$"
, where "$" is an operator that will replace the blank spaces in the text of K’th child.
Example Usage
Following are the rules to represent various aspects of the below HTML code.
-
Nested Elements
This rule matches the
<h1>
and<p>
elements nested inside the<div>
with the class "container". -
Randomization
This rule matches
<ul>
element with the class "main_list" and 3<li>
as children such that their order is not significant. -
HREF Attribute Matching
This rule matches
<a>
element with class link whosehref
attribute must start with "https://". -
Text Matching
This rule validates that there must be atleast one
<p>
element in the DOM such that it has the text "sample HTML document” . -
Choice in HTML Tag
This rule matches an element with id "main" such that it can be either a
<div>
or a<section>
. -
Match ID or Class with text of Kth Child
This rule matches an element
<ul>
with class main_list such that its id is same as the text of the 3rd child and replaces the blank spaces with "_".
Local Setup and Development
This is a Next.js project bootstrapped with create-next-app
.
Setting up the project
-
Please make sure you have Node.js installed on your system.
-
Clone the repository and navigate to the root directory of the project.
-
Run the following command to install the required dependencies.
-
Run the following command to start the development server.
-
Open http://localhost:3000 with your browser to see the result.