Codepath

PHP Encoding for HTML

Sanitizing output for HTML

HTML can contain almost any character. There are a few characters which have special meanings in HTML and should be used with caution.

Reserved HTML characters

<  >  &  "

These characters do not have special meanings in all parts of HTML. For example, it is not a problem to use " inside a paragraph of text, but it is a problem to use it inside an HTML attribute name.

<?php
  $title = 'Movie: "Star Wars"';
?>

<p title="<?php echo $title ?>">I love the movie "Star Wars".</p>

In the example above, the " is a problem for the title attribute but not for the content of the paragraph. Here is what the resulting HTML would look like. HTML would read the title value as "Movie: ".

<p title="Movie: "Star Wars"">I love the movie "Star Wars".</p>

The < and > are the most problematic characters because they indicate the start and end of tags. Using them accidentally could open a new tag or close an existing tag when it is not intended and break the entire page structure.

Allowing these characters in dynamic data also opens up the possibility that additional tags—including form elements and JavaScript—could be inserted in the HTML of the page. This is a common security vulnerability that hackers like to exploit.


HTML encoding

All dynamic values should be encoded (i.e. "transformed") before being used anywhere in HTML. This will ensure that the content does not interfere with the structure of the HTML. Web developers output a lot of dynamic data to HTML, so HTML encoding happens routinely.

This is a major concern for security because embedded-JavaScript needs HTML tags to function. It will be a primary defense against Cross-Site Scripting attacks.

There are different types of encoding depending on the context. Encoding for HTML means converting reserved characters into HTML character entities.

HTML character entities are written as &code;, where "code" is an abbreviation or a number to represent each character. There are thousands of HTML character entities, but for encoding special characters, there are only four that matter.

char entity
< &lt;
> &gt;
& &amp;
= &quot;

PHP encoding functions

PHP has two built-in functions which can help with HTML encoding. The first encodes only the four reserved characters. The second encodes as much as it can.

htmlspecialchars()

  • Encode reserved characters as HTML entities
  • Ignores single quotes by default, but configurable
  • Use for all output inside HTML

htmlentities()

  • Encode all possible characters as HTML entities
  • Use for safe and pretty output in HTML

Example:

<?php $string = 'We have to watch out for < and & as well as " and >'; ?>
<p><?php echo htmlspecialchars($string); ?></p>
Would output:
<p>We have to watch out for &amp;lt; and &amp;amp; as well as &amp;quot; and &amp;gt;</p>

Example:

<?php $symbols = "™ ® © • £ ¢ ¥"; ?>
<p><?php echo 'Symbols: ' . htmlentities($symbols); ?></p>
Would output:
<p>Symbols: &amp;trade; &amp;reg; &amp;copy; &amp;bull; &amp;pound; &amp;cent; &amp;yen;</p>

There are PHP functions which can decode these encoded strings (htmlspecialchars_decode(), html_entity_decode()) but they are almost never needed because the browser does the decoding that matters when it processes the HTML page.


Pro Tip

Because encoding for HTML is done frequently and because the function name is very long, most PHP developers define a custom function as a short cut.

<?php
  function h($string="") {
    return htmlspecialchars($string);
  }
?>

<?php echo h("This is safe for < and >."); ?>

Much easier!


Encoding for URLs inside HTML

When outputing a dynamic link to an HTML page, it should be encoded for the URL and also encoded for HTML. Because all output should be encoded for HTML.

Example:

<?php
  $course = 'web security';
  $query = 'URL encode & decode';
  $label = 'Link label with < and >';

  $url = rawurlencode('/courses/' . $course . '/content');
  $url .= '?search=' . urlencode($query);
?>

<a href="<?php echo htmlspecialchars($url); ?>">
  <?php echo htmlspecialchars($label); ?>
</a>

Other HTMl sanitizing functions

strip_tags()

PHP's strip_tags() function will remove all HTML and PHP tags from a string. It is an exception to the "don't remove content" rule because it is well-designed to remove all tags.

<?php
  $string = '<p>Text</p><!-- Comment --><a href="link.php">Link</a>';
  echo strip_tags($string);
  // TextLink
?>

It is possible to whitelist tags which should still be allowed but, as the PHP manual notes, this opens up the possibility for abuse of the tag attributes such as style and onmouseover.

When removing all tags, strip_tags() is as secure as htmlspecialchars(). When tags are allowed strip_tags() is much less secure than htmlspecialchars() and should be used with caution.

filter_var()

PHP's filter_var() function will apply a selected filter to a value. Filters are grouped into sanitizing and validating. At first, filters may seem harder to use than simple functions, but they are powerful.

The FILTER_SANITIZE_FULL_SPECIAL_CHARS filter has the same effect as htmlspecialchars().

<?php
  $string = 'We have to watch out for < and & as well as " and >';
  echo filter_var($string, FILTER_SANITIZE_FULL_SPECIAL_CHARS);
?>

Other sanitizing filters include:

  • FILTER_SANITIZE_ENCODED: encodes for a URL, like rawurlencode()
  • FILTER_SANITIZE_URL: remove all characters not allowed in a URL
  • FILTER_SANITIZE_EMAIL: removes characters not allowed in an email address
  • FILTER_SANITIZE_STRING: removes tags, like strip_tags()
  • FILTER_SANITIZE_NUMBER_INT: removes characters not allowed in numbers
  • FILTER_SANITIZE_NUMBER_FLOAT: removes characters not allowed in floats

Pro Tip

If the filter_var() syntax seems cumbersome, it is possible to wrap them in custom functions with names which are easier to remember.

<?php
  function sanitize_email($value="") {
    return filter_var($value, FILTER_SANITIZE_EMAIL);
  }
?>
Fork me on GitHub