Data Mining Due to the increasing use of technology-enhance educational assessment, data mining methods has been exploring to analyze process data in assessment log files. However, most are studies were limit to one data mining technique in a specific setting.

This study demonstrates the use of four frequently used supervised techniques, including regression and classification trees (CART), gradient amplification, random forest, support vector machine (SVM), and two unsupervised methods, the self-organizing map (SOM ) and k – means, adjusted to an evaluation data. The 2012 US sample from the Program for International Student Assessment (PISA) (N = 426) responding to the problem-solving items are drawn to demonstrate the methods.


With the advancement of technology incorporate into educational assessment, researchers have been intriguing by a new type of data, process data generate from a computer assessment, or new data sources, such as typed or eye-tracking data. . Most time, this data, often referred to as the “ocean of data. Has a very large volume and few out-of-the-box functions. How to explore, discover and extract useful information from such an ocean has been a challenge.

What analysis should be perform on such process data? Although specific analysis methods should be us for different data sources with specific characteristics. Some common analysis methods can be performe based on the generic characteristics of the log files. Hao et al. (2016) summarized several common analytical actions when introducing the Python package, classy.


The US sample (N = 429) was taken from the 2012 PISA public data set. Students were between 15 years and three months and 16 years and two months, representing 15-year-olds in the United States (Organization for the Development of Economic Cooperation, 2014). Three students with missing student IDs and school IDs were eliminating, resulting in a sample of 426 students.

There was no lack of answers. The data set was randomly divide into a training data set (n = 320, 75.12%) and a test data set (n = 106, 24.88%). The size of the training data set is usually around 2-3 times the size of the test data set to increase the accuracy of the prediction is (e.g., Sinharay, 2016; Fossey, 2017).


There are 42 problem-solving questions as in 16 units in PISA 2012. These items assess the cognitive process in solving real-life problems in computer-simulated scenarios (Organization for Economic Development Cooperation, 2014). In this study, the problem-solving item, TICKETS task2 (CP038Q01), was analys. This is a level 5 question (there were six levels in total) that requires a higher level of exploration and understanding to solve this complex problem (Organization for Economic Development Cooperation, 2014). This interactive question asks students to explore and collect the information necessary to make a decision. The main cognitive processes involved in this task are planning and execution. Given the problem-solving scenario, students should develop a plan, test it, and modify it if necessary.

Description of Data

The PISA 2012 log file dataset for the troubleshooting item was download from The data set consists of 4,722 actions from 426 students as rows and 11 variables as columns. Eleven variables (see Figure 2) include: cnt indicates the country, which is the United States in this study; school and StIDStd indicate the unique identifiers of the school and student respectively; event_number (ranging from 1 to 47) indicates the cumulative number of actions performed by the student; event_value (see the raw event_values shown in Table 1) indicates the specific action the student has taken at any given time. And the time indicates the exact timestamp (in seconds) corresponding to the event_value. The event reports the nature of the action (start item, end item, or actions in progress).