(English) Data driven language learning

ขอโทษครับ บทความนี้มีแต่เวอร์ชั่น English เท่านั้น For the sake of viewer convenience, the content is shown below in the alternative language. You may click the link to switch the active language.

A post that outlines my experiment in learning Bahasa Indonesia by identifying and memorizing 80% of the most frequently used word in a body of text.

Some research here discusses various word frequency and vocabulary size required to be able to learn a language. The page mentions that “By knowing the 2000 most frequent word families of English, readers can understand approximately 80% of the words in any text. Therefore, the goal of an English learner should be to acquire these 2000 word families first, since this relatively small number of words is recycled in any piece of writing and ensures the basis for reading comprehension.

I figured I would do the same for Bahasa Indonesia. The strategy is to load up the top 80% of frequently appearing words as well as their English equivalent into Anki and memorize them, allowing me to comprehend a significant part of the language relatively quickly.

The first attempt was to try to extract words from the Bahasa Indonesia Wikipedia corpus by getting the “All pages, current versions only.” dumps from here. The words were extracted from the document, lower cased with all non-alphabet and spaces removed and ranked in order of frequency. This attempt failed due to the sheer number of tags and other specialized symbols and meta data in the wikipedia document.

My second attempt used the corpus from a “500,000 Word Bahasa Indonesia Parallel Corpus with Penn Treebank” (Much thanks to Prasetya Dwicahya for his help verification) which worked better since it produced purely words. However the issue with this method is that often, individual words don’t make any sense on their own and words are actually word pairs or even a sequence of 3 words together. Take for example, one could memorize “thank” and “you” separately but miss out on “thank you” which is an important part of the vocabulary.

This issue was resolved by scraping kamus.net an Indonesian <-> English dictionary. A table was created with the Indonesian word in one column, the equivalent English in another and this was matched against the corpus loaded from the Treebank that resulted in a third column containing the frequency.

For example:

yang which
dan and
and is

The results were saved as a CSV file and loaded in Anki, a spaced repetitive learning system that works like a flash cards with Indonesian words in the front, English at the back but rather than show the words randomly will show you words you have difficulty remembering frequently and easy words less so.

I’ve been on it for about a week now and have so far picked up a vocabulary of about a hundred words. I hope somebody will find this post useful and be able to build up on it. In case anyone is wondering, 1327 words forms 80% of all the words in the corpus.

Update: someone has a slightly different approach for Chinese but also seemingly very effective, please have a look.


CC BY 4.0 This work is licensed under a Creative Commons Attribution 4.0 International License.

2 thoughts on “(English) Data driven language learning

  1. Thank you this was a very interesting read. You don’t have the CSV you could share publicly do you? How did your learning go?

    The 80:20 rule only goes so far. Particularly in Indonesian words have multiple meanings e.g. ‘sayang’ meaning ‘love’ or ‘pity’, and not accounting for phrasal verbs such as ‘sama sekali’ (entirely). Getting from 80% to 95% means going from 1350 words to 20,000+ words, but those words are used more often that you might think.

  2. Hi, could you email me mishari [at] mishari [dot] net and I’ll send you the scripts I wrote. Due to copyright and TOS I think I can’t really send the CSV directly.

    Effectiveness wise, not very, I think each word has too many different meanings in different context for this approach to be useful so I’m using Duolingo now + Anki with pretty good results. I’m also going to try the new approach here

Leave a Reply

Your email address will not be published. Required fields are marked *