# PCA and K-means Clustering Indonesia College Entrance Examination — UTBK 2019

UTBK is an annual college entrance exam held nationwide by state universities in Indonesia. Eligible exam takers are high school students that are graduated within a maximum of recent 3 years. Upon examination, exam takers will be able to apply to universities using their exam scores with a maximum of 2 choices of major and its respective universities.

In addition, There are two types of high school students, one is science-major students and the other one is humanities-major students. UTBK consists of 2 main parts and is as follow:

The first part of UTBK is Scholastic Aptitude Test covering:

1. KPU (Kemampuan Penalaran Umum) — General Reasoning

2. KUA (Kemampuan Kuantitatif) — Quantitative Skills

3. PPU (Pengetahuan & Pemahaman Umum) — General Knowledge

4. KMB (Kemampuan Bacaan & Menulis) — Reading & Writing Comprehension

The second part of UTBK differs according to the students’ major.

Science major subjects:

1. Mathematics for Science(mat)

2. Physics (fis)

3. Chemistry (kim)

4. Biology (bio)

Humanities major subjects:

1. Mathematics for Humanities(mat)

2. Geography (geo)

3. History (sej)

4. Sociology (sos)

5. Economy (eko)

Datasets are retrieved from Indonesia College Entrance Examination — UTBK 2019, consisting of exam takers’ score for each subject, their chosen major and its respective universities, and whether it is first or second choice.

## PCA and Clustering using K-means

We will use an unsupervised learning technique, Principal Component Analysis (PCA) to reduce redundancies and we would be able to see what variables usually appear together. Secondly, on the plot we will also add K-means clusters to partition the universities and understand what characterizes a group.

We are gonna explore and visualize the dataset by this idea:

- How are the subject scores correlated with one another?
- Which universities are similar based on applicants’ score?

Based on silhouette width and elbow method for universities clusterization, it is best to partition science into 5 clusters and humanities into 4 clusters.

Now, let’s see what pops out!

## Interpreting The Visualization

Also known as Biplot, the visualization above can be interpreted as follow:

- The percentage of the original variance explained by each component (dimension) is given in parentheses in the axes labels.
- Positively correlated variables have similar vectors and are grouped together.
- The vectors of negatively correlated variables are on opposite sides of the plot origin (opposite quadrants).

With the first two principal components on Science representing 95.3% of the variance and 91.2% of the variance on Humanities, most of the information we are interested in is already summarized in these two components so we can be ensured to derive some insights from it.

We can see from the vectors that for Science, general knowledge score(score_ppu) and reading & writing comprehension score(score_kmb) are highly correlated, physics score(score_fis), and chemistry score (score_kim) are highly correlated.

As for Humanities, general reasoning score(score_kpu) and history score(score_sej) are highly correlated.

For both Science and Humanities, all of the scores are positively correlated to one another and there is no negative correlation. Though positively correlated, mathematics scores have the weakest correlation to other subjects as its vector is the farthest from the other variables.

We also do a scatter plot of universities, colored by the cluster using k-means, group by ellipses. A few things to note from the scatter plot,

For science :

- Clusters 2, 3, and 4, all have small absolute values in Dimension 2.
- Cluster 1 has the greatest negative values in Dimension 1.
- Cluster 5 has the greatest positive values in Dimension 1.

For humanities:

- Clusters 1,3,4 all have small absolute values in Dimension 2.
- Cluster 4 has the greatest negative values in Dimension 1.
- Cluster 2 has the greatest positive values in Dimension 1.

Universities in cluster 1 for Science and cluster 4 for Humanities have large negative values in Dimension 1. The biplot shows that all the subject score also have high negative values in Dimension 1. So these universities have high values in this variable.

Moving on, let’s see what characterizes universities in each cluster according to their applicants' scores!

We can note that be it Science or Humanities, in general, universities that receive applications with high score (higher standard deviation) on a given subject will also receive high score on other subjects. The same condition is true for universities within average score cluster(standard deviation near zero) and low score cluster(lower standard deviation).

Though with some exception as we may notice on Humanities, a cluster 1 line with its corresponding university name, *Universitas Negeri Jakarta* has a history score(score_sej) at the top-end and sociology score(score_sos) at the bottom-end.

For Science, Cluster 1 with relatively high scores across all subjects consists of Institut *Teknologi Bandung*, *Universitas Gadjah Mada*, *Universitas Indonesia,* and *Institut Teknologi Sepuluh Nopember*. As for Humanities, *Universitas Negeri Jakarta* might be an outlier and therefore might mislead the result. Therefore some measures need to be done to ensure us before moving on.

Reference :