Word Frequency Analysis of “Pajama Drive” Setlist by JKT48
Introduction
Word frequency analysis explains the occurrence frequency of grouped words in particular given text corpus. In language teaching context, rational basis of word list by frequency helps language learners to get optimum results for their vocabulary learning process (Nation, 1997). However, word list by frequency intended directly for writers, not learners. Word frequency analysis is also one of common methods in computational linguistics to make corpus analysis process is easier.
An example of existing word frequency analysis is from Global Database of Events, Language and Tone (GDELT). They analyzed top 100 names mentioned in articles from English language news sources about a specific world leader and how often these names occurred. By mapping the results in word cloud format, we can see the relationships between world leaders. In case of Toomas Hendrik Ilves, president of Estonia, the US president, Barrack Obama is the most frequent name appeared in articles related with Ilves. You can check the complete article here.
In this article, we will see the word frequency analysis of “Pajama Drive” setlist by JKT48. We will find how many times certain words appear in this setlist and other interesting findings related with it. It is expected that word frequency analysis will help us to understand better what the setlist is about. For personal project and academic purpose, I am collecting all song lyrics of JKT48 as a text corpus (large and structured set of texts).
Methodology
The process began by compiling all lyrics in “Pajama Drive” setlist into TXT file. The word frequency was counted using Python, a programming language to serve highly readable language. For Python IDE (Integrated Development Environments) or simply Python editor, I use Komodo.
Results
The output produces the result that “Pajama Drive” setlist have 2914 words and have 964 unique words. The calculation accounted for word repetition in verses and chorus in a song.
To highlight the important and significant words, I removed any stop words, such as “dan” (and), “ke” (to), “dari” (from), “di”, “karena” (because), “kan”, “ketika” (when), etc. After stop words cleaning, I eliminated the words that occurred less than 4 times in whole setlist. As a result, we have 103 most frequent words in “Pajama Drive” setlist that its occurrences are minimum 4 times. There is a word that records the highest occurrence, which is 59 times.
Those 103 most frequent words are categorized into pronouns (kata ganti), nouns (kata benda), adjectives (kata sifat), verbs (kata kerja), adverbs, particles and the words from foreign languages (English and Japanese). Verbs category is divided into five subcategories: root words (kata dasar), active verbs (kata kerja aktif) using prefix ‘me-‘ or ‘me-kan’, passive verbs (kata kerja pasif) using prefix ‘di-‘, intransitive verbs (verb that does not need an object), and command words.
You can check the data (it is very ugly and gross data, I warn you) here.
Discussion
If you feel do not want to read following long and boring text, you can just go to the infographic that display the word frequency results below.
Some interesting results are found in this word frequency analysis. The word aku (I) appeared 48 times and ku (the short version of aku) appeared 59 times. Both words are the first and second rank of highest occurrence. The word diriku (me) appeared 20 times, dirimu (you) appeared 17 times. To make it more interesting, although it is not top 103 frequent words, dirinya (he/him or she/her) appeared only one time. Other words to define ‘you’ in Indonesian: kau is sung 23 times and kamu is sung 17 times.
The most popular words in pop songs, cinta (love) appeared 30 times in “Pajama Drive” setlist. The ubiquitous words in 48 Family songs, mimpi and impian (dream, yume) appeared 9 times and 6 times, respectively. Word harapan (hope) appeared 5 times.
In adjectives category, the interesting is both words kuat (strong) and lemah (weak) appeared 5 times. Meanwhile, the most mentioned adjective in “Pajama Drive” setlist is putih (white) with 10 times occurrence, followed by suci (pure) appeared 8 times.
From verbs category, we can infer that “Pajama Drive” may tell us about waiting and sadness. Word menunggu (waiting) appeared 11 times, menangis (crying) appeared 5 times, hilang (lost, disappear) appeared 8 times, pergi (go) appeared 7 times. Disapa (to be greeted by) and dicium (to be kissed by) shared 4 times occurrence. The word maafkanlah (please forgive) appeared 5 times. This setlist said tidak (no) 28 times and said ya (yes) only 2 times.
There are three words from English and Japanese listed in 100 most frequent words. Jump appeared 25 times, wasshoi appeared 12 times, yoroshiku appeared 4 times. Apart from 103 most frequent words, in total “Pajama Drive” setlist have 51 foreign words. There are 40 English words and 11 Japanese words.
Infographic link:
Further Analysis
The word frequency analysis of JKT48 songs has not finished yet. I will continue to analyze other setlist or JKT48 singles. The further analysis might be the broader analysis of all setlists in one experiment to measure some aspects, such as consistency among setlists or word ranks versus word frequency using Zipf’s law and power law probability distributions.
Reference
Nation, P., & Waring, R. (1997). Vocabulary size, text coverage and word lists.
Vocabulary: Description, acquisition and pedagogy, 14, 6-19.
I somewhat expected that cinta would be at or near the top of the count for nouns, but putih and suci in adjectives are somewhat intriguing. It’s evident then that there is some thematic coherence in the lyrics as one could equate those two adjectives together.
Is the instance of arc in the count of foreign words from the English word, or is it from Joan of Arc / Jeanne d’Arc? If the latter, perhaps it should be excluded as it is not an actual word per se.
Thank you for the comment, Mas Richard.
Actually, I used Joan of Arc (English spelling), instead of the French one Jeanne d’Arc. You are correct. If I use the French spelling, the program will read “d’Arc” as one word.