Converting Cyrillic Word-Processed Files to Text

If you work with Cyrillic (Russian, etc.) on a computer which uses a non-Cyrillic system font, and have ever tried to save a file created in a common word processor as text, you know the problem. You have a word-processed file which contains Cyrillic text. The Cyrillic text appears perfectly in the word processor, but if you save it as text and then try to open the text file, all you see is question marks. (If you use a computer which has a Cyrillic system font, then this should not be a problem.)

Is there a way to get around this?

One way is to change your computer's system font to a Cyrillic system font. If you want to experiment with changing your system font, some operating systems may allow you to configure this yourself. If you are using a common desktop operating system and need a third-party solution, you might consider Parawin, which is available from such companies as Smartlink Corporation and VirtualWare Technologies. (Note that a boot manager such as System Commander will allow you to install multiple versions of the same operating system on your computer, so that you could have one version with a Western system font and another with a Cyrillic system font, if you are worried about the Cyrillic system font causing problems.)

However, most people who do not already have a Cyrillic system font would not want one. This is especially true if you use other foreign languages such as German or French, since the other languages' non-ASCII characters (é, ô, ß, ü, etc.) will not be rendered correctly by a Cyrillic system font. Moreover, most people think that this is too much work and expense to solve such a simple problem.

Another way is to use the freeware product Antiword, which is capable of extracting text from certain kinds of word-processed documents. This solution certainly works, however it is aimed mostly at people who do not have access to the word processor in question.

Fortunately, there is an easier way:

(1) Launch an HTML editor such as Mozilla or Netscape, and open a new Composer page.
(2) Select the character encoding you want, for example "Cyrillic (KOI8-R)"
    or "Cyrillic (Windows-1251)". To do this, access the menus as follows:
    View | Character Coding | Cyrillic (Windows-1251)
    The first time you do this, you'll have to look further under the menus:
    More | East European
(3) Switch to your word processor, select all the text you wish to save as text, and copy it.
(4) Switch to Composer, and paste the text into the new Composer page.
(5) Save the HTML file.

Now, open this HTML file using your word processor. To do this, use the "Open" command: Don't double-click on the file. Voilà! You have your Cyrillic back (without formatting, of course).

Next, close the HTML file in your word processor and open it using a text editor. If you have a non-Cyrillic system font, the text will look like garbage (e.g., "ÑÚÙË"), but it is actually in Cyrillic text displayed using a non-Cyrillic font. Note: If you see codes like "язык", you either forgot to change the character encoding in Composer, or you did so AFTER you pasted in the text. You MUST set the character encoding BEFORE you paste in the text.

This HTML file is a flat text file, although it also contains HTML tags such as the following:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
  <meta http-equiv="content-type" content="text/html; charset=KOI8-R">
  <title>russdict_text.koi8</title>
</head>
<body>
<br>

You can delete these tags now, if you don't want them. Note: Once you strip out the tags, your brain-dead word processor may no longer be able to display the text correctly, however the text is still correct.

You now have a Cyrillic text file! Note that if you want to e-mail this file it must first be MIME-encoded or uuencoded, or it will be damaged if it goes through a 7-bit gateway. Most common mail clients do this automatically today.

Click here to return to Thomas Hedden's home page

Copyright © 2003-2010 Thomas Hedden

This page is viewable with any browser.

Valid XHTML 1.0!

Valid CSS!