HTML

Encoding issues in HTML documents

Today I came across some problems while creating some French content for a new website. The French language includes characters which require the web page's charset to be set to utf-8, or must be converted to appropriate html entities (see this excellent list of HTML entities for French sites).

I edit my html files using vim on windows. I have setup vim so that it could edit utf-8 encoded files (including chinese characters), and save them as such. So I wrote the accents directly in the html code, without using html entities. When I uploaded to the site, Firefox displayed question marks inside black diamonds wherever these characters should be.

I tried different solutions, but one of them worked particularly well in my case:

The site in question is Joseph SARL, a simple shopfront site. I built it using a templating system in wrote in PHP5, and which I used for other small sites. All pages are called from index.php, which then includes the requested page (e.g. index.php?page=contact). The idea was to use a couple of PHP functions to convert the special characters to html entities automatically. So, instead of including the html file I decided to fopen, then fread it, and pass its contents through the htmlentities() php function, as follows:

<?php
class template {
    public function 
printPage() {
        
$handle fopen($this->page"r");
        
$contents fread($handlefilesize($this->page));        
        return 
htmlentities($contents);
    }
}
?>

The problem is that this also converts the characters which are part of the HTML tags, such as '<' and '>', so the result is one paragraph where the entire HTML source appears. I only want the accents and other special French characters to be converted. There is no function in PHP which can do that, but there is one which can decode only the html entities which convert back to HTML code: htmlspecialchars_decode. I just need to apply this to the above contents:

<?php
class template {
    public function 
printPage() {
        
$handle fopen($this->page"r");
        
$contents fread($handlefilesize($this->page));        
        return 
htmlspecialchars_decode(htmlentities($contents));
    }
}
?>

Excellent! This converts only the non-HTML special characters into HTML entities, and my page is no longer dependent on page encoding or browser settings. Only one more problem: some of my included pages have PHP code in them, and this code now appears in plain text, it doesn't get parsed by the PHP engine anymore... The solution is the eval() statement. The plan is this:
  1. Convert the french chars to HTML entities
  2. Parse the page for PHP code
  3. Capture the result and send it to the browser

The only problem is that eval() expects to be given PHP code, not plain HTML with interspersed PHP tags. If I just feed it a plain HTML page, it will try to parse it as PHP code and throw fatal errors. The trick is that eval() automatically adds PHP tags around the passed code:

<?php
    
eval('echo "test";');
    
// Equivalent to <?php echo "test"; ?>
?>

So, in theory, all I need to do is close the PHP tag at the beginning of my string, and re-open it just before the end. Let's try this:

<?php
class template {
    public function 
printPage() {
        
$handle fopen($this->page"r");
        
$contents fread($handlefilesize($this->page));
        
$contents = eval("?>" htmlspecialchars_decode(htmlentities($contents)) . "<?php ;");
        return 
$contents;
    }
}
?>

Result: Bingo! It works like a charm. This would be very dangerous if the page contained dynamic elements such as the contents of a database, or posts in a forum. But this is a static site where I control all the content, so it's pretty safe.
Syndicate content