Working with foreign texts. How to increase understanding and interest to learn the language?

In life or at work sometimes you have to deal with texts in foreign languages, when your knowledge is still far from perfect. To read and understand what is this for (and, in best case, to learn a few new words), I usually use two options. The first one - translation of the text in browser, the second - the translation of each word separately using, for example, ABBYY Lingvo. However, these methods have many shortcomings. Firstly, browser offers translation of sentences, this means that order of words can be changed and translation can be even more confusing than the original. Secondly, browser does not offer any alternatives translation or synonyms to words, and thus learning new words becomes problematic. Alternatives and synonyms can be obtained by searching for a specific word in translator, but it takes some time, especially if there are many words. Finally, while reading text I would like to know what are the most popular words in this language, so that I can remember them, and use in my writing or speaking experience.

I thought it would be nice to have such a "translator" on hand, and so I decided to put it into python. All who are interested, welcome under cut.

Word counting

When writing the program, I was guided by the following logic. At first, you need to rewrite all the text in lowercase, remove unnecessary characters and symbols (.?!, Etc.) and count how many times each word appears in the text. Inspired by the code from Google, I did it without the slightest difficulty, but I decided  to store results in slightly different form, namely {1: [group of words that is with frequency 1] 2: [- // - with frequency 2 ], etc.}. This is useful if you want to sort within each group of words, for example, if we want the words were in the same order as in the text. As a result I want to get a double sort: the most common words at the beginning and ordered according to the source text if they occur with the same frequency. This idea is reflected in the following code.


Excellent, everything works as I wanted, but I suspect that in the top of the list there will be auxiliary words (such as “the”) and other, with obvious translation (eg, you). It is possible to get rid of them by creating a special list of the most used words to exclude them while creating vocabulary. Why is it more convenient? Because, having learned particular word, we can add it to the list, and the translation will no longer be shown. Denote a list of variables dictList and forget about it for a while.

Translating words

Spending a few minutes for search of a suitable online translator, it was decided to check in action Google and Yandex. As far as Google have closed Translate API exactly 3 years and 1 day ago, we will use the bypass option proposed by WNeZRoS. In response to the request of a word Google offers translation, alternative translations and back translation (ie, synonyms). Using Yandex normally requires obtaining a key, and in response to the request not only translation can be found but also examples, and probably something else. In both cases, the answer will contain a list json format. It is rather simple for Google, and a little bit complicated for Yandex. For this reason, and also because Google knows more languages (and often words), it was decided to fix on it.

Requests will be sent using the wonderful library grab, and answers will be stored in auxiliary text file (dict.txt). We will try to find in it the main translation alternatives and synonyms, and if they are present, print them. Will make sure that the last two options can be turned off. The corresponding code looks as follows.

As you can see, the default translation is set from German into Russian. Variable key corresponds to the frequency of words in the text. It will be transferred from another function, which will call translation for each word.

Call of translation function

It's simple: I want to get a group of words with the corresponding frequency in the form of a dictionary (function word_count_dict) and translate each word (function tranlsate). I also want to have only the first n groups of mostly used words.

List of mostly used words

Excellent, the program is almost ready; we should only make a list of the mostly used words. They are easy to find on the internet. I made a list of 50, 100 and 500 of the most used words in the German language and wrote it to a separate file.

If someone wants to make such a list in English or another language, I will be grateful if he or she will share them, so I can add it to mine.

Preliminary results

Running the program results as follows can be obtained:

Well, the code is written, the program works, but how is it convenient and efficient? Trying to answer this question, I took a couple of texts in German for verification.

The first article from Deutsche Welle on the topic of financing coal mining near Deutsche Bank Australia. This article contains 498 words, 15 of them are most frequently encountered in the text (use the list of 50 most frequently used German words for exceptions) correspond to 16.87% of the total text. Roughly, this means that if we assume that a person does not know these words, after reading the translation 6.67% of all words in the text, it will increase the level of understanding about 17% (if level of understanding is measured only in number of know words from text) . At first sight, it’s pretty good.

The second article from Spiegel tells how the German stock index DAX reacted to Poroshenko victory in the presidential elections in Ukraine (yes, it grew up). This article contains 252 words, 8 of them are the most encountered (6.06%), which corresponds to 11.9% of the text.

In addition, it should be noted that if translated text is short enough so that each word occurs only once (for example, a message received by e-mail), then it’s quite comfortable to follow the suggested translation in the same order as the words occur in the text.

Sounds nice (es klingt schön), but this is a very rough tests as I made too many assumptions. I think that finding out how is this program useful is possible only by regular use of program. But this is, unfortunately, is not very easy. In order to translate the text you should at first copy it to the .txt file, and give name of the file to variable filename, and then run the function print_top.

What is missing?

Instead of a conclusion I would like to think what is missing at this stage, and how it can be improved.

Firstly, as it has just been said, comfort. Using code is uncomfortable – you should copy the text + dependence on python and grab library. What to do? Alternatively, write an extension for browser to be able to select a specific element on the page (for example, similar to how it is implemented in Reedy) and get it’s translation. Secondly, the list of words to avoid the most commonly used in other languages. Finally, there are different bugs with encodings.

Most likely in the nearest future my hands will not reach making the above changes (since the code is written, it's time to start a better understanding of language!), So if someone wants to join, I'll be glad for the company and help.

All the code you can find lower, as well as on github.

Комментарии (1)

  • Сергей Кашуба

    Nice, but code in images