Weird Characters after cut-and-paste

By , August 22, 2011

A teacher asked about “Weird letter characters appearing when viewing [her] product description online.”

This definitely looks like a “character set” issue, which often happens when someone “cuts and pastes” from a software application that uses one character set into another application which uses a different character set.

This is rarely an issue for most ASCII or “regular typewriter characters,” which map identically across most Western character sets you’re likely to encounter, but it’s definitely a problem for more obscure characters, (including quotation marks [“ ”], accented characters [ñ à ë î], fraction symbols [¼], and more [™ ®]).

But even “plain text” might contain embedded “hidden” characters or might use character variations that aren’t visible to you but which aren’t part of the basic ASCII character set. For example, did you know that there many different variations for a space character (wikipedia), including a “thin space,” “hair space,” and an oxymoron called a zero-width space?

Some software also embeds normally-invisible codes (to signify bold or italic text, for example), but when that text is “cut and pasted” into another program, these codes aren’t recognized the same way by another software application, and instead appear as “weird characters.”

Quotation marks are a special case, because there are several different symbols used to represent quotation marks. (wikipedia)

I constantly have problems when I use Microsoft Word to edit text that I’ll later need to paste into another application, because by default Microsoft Word applies “smart quotes,” converting regular quotation marks (which map into nearly all character sets) into “opening” and “closing” quotation marks (which often map to other characters, including the fraction symbols I see in your text).  [Sometimes these distinct quotation marks are referred to as “curly quotes” but they only usually appear curly when using a “serif ” font; they’re usually “straight but at an angle” in a sans-serif font.]

WordPress (blog software) is even more troublesome: it stores most quotation marks internally as standard “vertical” quotation marks, but then when displaying text, it applies “smart quotes” so that opening and closing quotation marks are seen instead.  It also will sometimes transform standard quotation marks into opening and closing (left and right) quotation marks.

And although Windows Notepad (for example) doesn’t convert quotation marks into opening and closing versions, if I paste text from Microsoft Word or WordPress, the variant quotation marks remain in Notepad.

Here’s a snippet of text which I manually typed into Windows Notepad:

  • "Four score and seven years ago," said Lincoln….

And here’s the identical snippet of text which I manually typed into Microsoft Word:

  • “Four score and seven years ago,��  said Lincoln…

Those are “opening” and “closing” double-quotation marks, and they look correct even while I’m entering this post, but after I post it, I see the closing quotation marks as two mystery characters.

You can change the settings for Microsoft Word to change how quotes are handled:

Apart from the “smart quotes” issue, you can still experience a variety of bizarre “weird character” problems, because most web sites, including TPT, use a character set called UTF-8 (http://en.wikipedia.org/wiki/UTF-8). But when I save a file from Microsoft Word as a web page, it uses a character set called “windows-1252.” And even if you “cut and paste” text to or from a “plain text” file, it may retain characters that won’t map properly (as shown above), and which will look completely normal until you’ve hit the “submit” button.

On a related note, did you know that different web browsers display certain characters differently (or not at all)?  If you view the same exact web page using Microsoft Internet Explorer, Mozilla Firefox, Google Chrome, Apple Safari, or Opera, you’ll see many differences in how the page appears, sometimes including characters that are properly displayed by some browsers but not others. (When I decided to use properly-encoded “thin space” characters in a recent update to LessonIndex.com, I discovered that Opera doesn’t properly display the properly-encoded “thin space” character, but shows a little box symbol instead of a blank space.)

There are many other variations between web browsers, which can result in problems if you don’t test a web page (or HTML document) by viewing it with all five of these commonly-used web browsers.  (When I launched LessonIndex.com, I didn’t realize that a minor coding error disabled most of the links on every page for users of Microsoft Internet Explorer and Google Chrome, although the pages worked fine with Firefox and Safari.  Firefox and Safari actually detected and corrected the coding error when displaying the page, but other browsers did not.)

Finally, you should be aware that even if they don’t affect the display of your text, “character variations” can also have an adverse impact on search.  For example, some search systems recognize that including or excluding the accent for the word café doesn’t change its meaning in English , so a search for either variation will bring up all relevant results, but others do not — so someone searching for “cafe” without an accent won’t find documents that only use the word with an accent (and vice-versa).  Some search systems, designed with the English language in mind, simply ignore all accent characters (automatically substituting unaccented characters for every accented character).

Leave a Reply

*

OfficeFolders theme by Themocracy