问题描述:

In HTML,

<a HREF="http://...... & .... ">Dust & Bones</a>

needs to be escaped as follows:

<a href="http://...... &amp; .... ">Dust &amp; Bones</a>

What's the scope of where &amp needs to be applied. Is it just href or is it anywhere within HTML text? What about

<input value="http://... & ">?

or within

<script>... & ... </script>

do these need escaping?


update

The bigger question, which would explain this, is, when does the HTML parser look for &XXX; tokens and replace them? Is it done once on the whole document, or do different rules apply for the text between tags vs. attribute values within a tag vs. wihtin tagA vs. within tagB -- different parsing rules seem to apply within , so I may write && (for AND) and < for (LESS-THAN). So, what rules apply in which scopes?

网友答案:

The rules vary depending on the version of HTML you are dealing with but are always more complex then is worth trying to remember.

The safe approach is "Use character references to represent the 5 HTML special characters everywhere except inside script and style elements", which makes you safe for everything except XHTML.

For XHTML the rule is the same with the additional proviso of "and use explicit CDATA sections in script and style elements".


The bigger question, which would explain this, is, when does the HTML parser look for &XXX; tokens and replace them?

As it parses the HTML (depending on what the current state of the tokeniser is ("inside start tag" and "inside attribute value" are examples of different states)).

Is it done once on the whole document

Unless you trigger additional HTML parsing (e.g. by setting innerHTML on an element).

or do different rules apply for the text between tags vs. attribute values within a tag vs. wihtin tagA vs. within tagB

Different rules apply in different places. The complete, current rules are (as I suggested in a comment) rather complex and would require a lot of work to extract from the HTML 5 parsing rules. This is why I suggest, if you are an HTML author and not a browser author, using the simpler rules of "Use character references unless you are in a script or style element".

-- different parsing rules seem to apply within <script>, so I may write && (for AND) and < for (LESS-THAN). So, what rules apply in which scopes?

In HTML 4 terms, script and style elements are defined as containing CDATA (where the only sequence of characters with special meaning in HTML are </ which terminates the CDATA section). Everywhere else in the document (including, counter-intuitively, attribute values that are defined as containing CDATA) & indicates the start of a character reference (although there might be a few exceptions based on what the character following the & is).

The HTML 5 rules are more complicated, but the basic principle of "It is safe and sane to use character references for &, <, >, " and ' everywhere except inside script and style elements" holds.

相关阅读:
Top