Development of Test Instrument to Assess Students' Understanding on Macroscopic, Sub-Microscopic, and Symbolic Levels of Acid-Base Topics Using Rasch Model

Students' conceptual understanding on acid-base topic has not been assessed comprehensively in multiple representation. The test instrument is needed to determine levels of students' understanding on acid-base topic comprehensively. This research developed an instrument to asses students' understanding on acid-base concept based on macroscopic, sub-microscopic and symbolic levels that is claimed valid, reliable, had good difficulty index and discriminatory item. This research is Research and Development (R&D) using Rasch model. The research procedure adopted ten stages by Wei, et al in 2012. The research instrument is questionnaire sheet given to subjects from chemistry lecturers and chemistry teacher. The data were analyzed using Rasch model assisted by minifacet and ministep software. The exact agreement value is 89,2% which have small range from the expect agreement with a percentage of 89,5%. It’s mean that, there is a fit between the results of expert's agreement and the results of estimated model. The pilot test results showed that all items have reached valid criteria based on the MNSQ, ZSTD, and Pt-Mean Corr. The reliability value is 0,80 in the good category. Difficulty index has variations of easy, medium, and difficult items. The discriminatory item is able to distinguish three levels of student ability.


Introduction
 One of the primary subjects for seniors high school is chemistry. A multirepresentational approach that incorporates macroscopic, sub-microscopic, and symbolic levels that have linkages to each other and are related in chemistry learning might help students to understand the numerous abstract concepts that make up chemistry (Zahro' & Ismono, 2021). Only macroscopic level of three representation levels can be directly observed by the senses. Concepts at the level of particulates, such as molecules, atoms, or particles, are explained at the submicroscopic level. On the other hand, the symbolic level describes chemical phenomena in the form of symbols like reaction equations, diagrams, etc. If students can comprehend and connect the three levels, their understanding will be thorough and in-depth (Jariati & Yenti, 2020). In other words, all chemical concept and theories are built by three levels of representation which can not be separated from each other to achieve a comprehensive understanding. This is the characteristic of chemistry that causes majority of students think that chemistry is a difficult subject. One of the subject matter of chemistry with abstract and fundamental concepts is acid-base matter.
Preliminary research at four schools in Padang city indicates that student have trouble understanding about acid and base matter. The majority of student failing to reach the minimal completion requirement (KKM) on the daily acid-base test value serves as evidence of this. It can be assumed that students have not been able to fully describe chemical concepts at three levels of representation and their understanding is more dominant at symbolic than macroscopic and submicroscopic levels. Besides that, most teachers have taught the concept of acid and base through a multiple representation approach.
The learning media used such as power point, student worksheet and textbooks have also been equipped with pictures and explanations at the macroscopic, submicroscopic, and symbolic levels. However, the test instruments used by teachers have not been able to assess students' understanding comprehensively at all three levels of representation. The test instrument are only focused on symbolic understanding. So that the assessment process carried out by the teacher becomes not in related with what has been taught by students. Through assessment, information can be obtained about how well successful students are in learning and teachers in teaching students, so that it can be a feedback among them. In this case, assessment can be viewed as quality control of the learning activities carried out (Sudarmin, 2015).
Due to the unavailability of test instruments to test students' understanding at all three levels of representation, a test instrument was developed that could asses students' understanding of acid-base material comprehensively. In order to produce a good and quality test instrument in terms of validity, reliability, difficulty index, and discriminatory item, a Rasch model analysis is needed that can provide information about the characteristics of the items in the developed test instrument. The Rasch model was chosen because it has the advantage of accommodating measurement objectives more accurately (Bohori & Liliawati, 2019). This study aims to produced an instrument to asses students' understanding on acid-base concept based on macroscopic, sub-microscopic and symbolic levels using Rasch model that is claimed valid, reliable, had good difficulty index and discriminatory item.

Methodology
Type of research is research and development (R&D) using Rasch model (Wei et al., 2012). The stage of this study adopted ten stages of instrument development by Wei et al in 2012. The stages are (1) determine construct; (2) identify specified construct; (3) determine item result; (4) conduct pilot testing; (5) analyze data with Rasch model; (6) review item fit; (7) review Wright Map; (8) repeat steps 4-7 until it fits the model; (9) establish claims about the quality of items; and (10) Develop a documentation (Wei et al., 2012).
The subjects in this study consist of five chemistry lecturers from FMIPA UNP, one chemistry teacher from high school 3 Padang, and 30 students 12 th grade of high school 3 Padang who had studied acid-base material. The objects of this study is the quality of instrument in terms of validity, reliability, difficulty index, and discriminatory items produced. The type of data is primary data obtained directly through content validity tests by validator/experts and question trial data directly to students. The research instrument is a content validity questionnaire in the form of Guttman scale with a choice "Yes" or "No". The data analysis technique of results was carried out using Rasch model with help of minifacet and ministep programs.
The quality of items in terms of four criteria, namely validity, reliability, difficulty index, and discriminatory item. The validity of item is reviewed from Outfit Mean Square (MNSQ) value with a range accepted from 0.5 to 1.5, the Outfit Zstandard (ZSTD) value from -2.0 to +2.0, and the Point Measure Correlation (Pt Mean Corr) value accepted from 0.4 to 0.85. In measuring the reliability of items, it is based on the value of item reliability with a good category ≥ 8.0. The difficulty index give information about standard deviation and logit value for each item which can be seen from distribution of difficulty item on Wright map, specifically can be seen in output item measure. The discriminatory item can be known from the separation value. The separation (H value) can be determined by the following equation.

( )
If the separation value is bigger, so the discriminatory items and respondents is better, because it can distinguish between items group and respondents.

Define Construct
The first stage is carried out by making a learning progression, which includes the analysis of basic competencies (KD) from 2013 Curriculum which is derived into several competency achievement indicators (IPK) on acid-base material. The basic competency is KD 3.10, explain the concept of acids and bases and their strength and ionization equilibrium in solution. And then the analysis of multiple representation aspects to find out the macroscopic, sub-microscopic, and symbolic representations contained in acid-base concept to be assessed. Table 1 shows the analysis of learning progression.

C2
Sub-microscopic, the ionization process of acids in water so as to produce hydronium ions (H 3 O + ); the ionization process of bases in water so as to produce hydroxide ions (OH -). Symbolic, the combinationof acid and base ionization reactions in water. 3.10.2 Explain the concept of acidbase according to Bronsted-Lowry. (Mind IPK) C2 Symbolic, the combination of acid and base reactions involving the transfer of protons.
3.10.3 Explain the concept of acidbase according to Lewis. (Mind IPK) C2 Symbolic, the combination of reactions involving the transfer of electron pairs in the formation of coordination covalent bonds.
3.10.4 Explain the effect of acid or base ionization on water equilibrium. (Mind IPK) C2 Macroscopic, pH value; changes in the pH of the water after adding acids and bases. Sub-microscopic, the effect of addition of acid or base to water equilibrium; difference in the number of H + ions and OHions in acid and base solutions. Symbolic, equation of water equilibrium reactions before and after the addition of acidic and base substances; comparison of H + and OHconcentrations before and after the addition of acids to water; comparison of H + and OHconcentrations before and after the addition of bases to water. 3.10.5 Explain the strength of acids and bases. (Mind  IPK) C2 Macroscopic, measuring the strength of acids and bases using an electrolyte tester (ex: strong acids and bases are able to flame the lamp brightly and produce many gas bubbles, weak acids and weak bases are able to dim the lamp and produce a small amount of gas bubbles). Sub-microscopic, ionization process of acids or bases in water (strong acid and base are 100% ionized in water, weak acids and bases are partially ionized in water) Symbolic, ionization reaction equation of strong acids and strong bases in water; ionization reaction equation of weak acids and weak bases in water (forward and reverse reaction; acid ionization constant (Ka), base ionization constant (Kb).

Identify Defined Constructs
This stage is carried out involves creating a test instrument grids that includes KD, IPK, and item indicators that have chemical multiple representation elements and choosing the form of questions to be developed. The item indicators are designed based on main IPK which has been analyzed for its chemical multiple representation aspects. There are 24 item indicators based on IPK analysis, and the form of question chosen is an essay test. The essay test was chosen because of avoiding the possibility of guessing answer by students and essay questions also require the ability of students to interpret their ideas, so that their cognitive understanding can be tested on aspects of chemical multirepresentation in more depth. Then, assessment in the form of essay (description) can produce higher item information function values (Diputera, 2019).

Define Outcome Space of Items
The design of items and assessment rubrics is carried out based on item indicators of test instruments and level understanding of chemical multirepresentation, which is an important point for making this instrument. The analysis of the grid with 10 items discourses produced in 24 sub-question items which have macroscopic, sub-microscopic, or symbolic related to each question. One set of test instruments consist of question discourse, sub-questions, and assessment rubrics. In the assessment rubric, there are answer keys, the achievement of students' level understanding of chemical multiprepresentation, and scoring guidelines. Table 2 is a description of level understanding chemical multirepresentation. Students are able to answer questions at the level of macroscopic and symbolic understanding. Level 2: understanding sub-microscopic understanding with symbolic (Knowing the shape of the structure of atoms, molecules, and ions; writing the symbols of the reaction equation correctly).
Students are able to answer questions at the level of sub-microscopic and symbolic understanding.
Level 3: understand and interpret macroscopic, submicroscopic and symbolic understanding with macroscopic (Identifying colors, phenomena occurring and knowing the shape of the structure of atoms, molecules, and ions and being able to write symbols of reaction equations correctly; stating and explaining macroscopic phenomena and processes from sub-microscopic).
Students are able to answer questions at the level of macroscopic, submicroscopic, and symbolic; or macroscopic and submicroscopic understanding. (Wang et al., 2017) Each question has an interconnected level chemical representation in the subitems of question. Figure 1 is an example of question design that tests understanding of sub-microscopic and symbolic levels. The IPK achieved was explaining the concept of Arrhenius acids and bases. Sub-point 1a tests conceptual understanding at the sub-microscopic level. Students are required to explain about Arrhenius acids theory based on sub-microscopic illustration of acid ionization process given. Whereas question 1b requires students to write down the equation for the ionization reaction of Arrhenius acid, which is knowledge at the symbolic level. Through these two sub-items, students' understanding can be tested comprehensively on Arrhenisus acids and bases concept especially. Furthermore, the instrument was validated by six validators who were subject matter experts. Content validity uses Guttman scale with choice "Yes or No" for the assessment aspects given. The data of validity were analyzed using Rasch model with minifacet software program. The analysis of content validity by experts can be seen in Figure 2. Figure 2 is a Wright map that shows validity content data from expert's assessment with nine aspect criteria items. Based on the picture, there are four columns with different information. The first column is the measure column which shows the logit scale with a values range from -5 to +3. The second column is an explanation about quality of items from validator. It can be read by looking at the order of questions from top to bottom. The top items indicate the best quality, because they are achieve most criteria in assessment aspect. The third column is code for assessment aspect criteria. It describes the order of difficulty aspects. The top aspect is the most difficult for all items to achieve. While the bottom aspects are the easiest for item to achieve according to assessment given by validator. Next, the fourth column is the validator column which explains the order of the validator's assessment.

Figure 2. Wright Map of Expert and Items
Based on the results of content validity analysis using Rasch model which consists of 10 essay questions, there are 2 questions with the best quality are obtained which are in the top position, namely item 3 and 4. Item 3 and 4 have reached all the criteria for the assessment aspect because the position is higher than all aspects on the logit scale. Furthermore, in the column aspect criteria, there are several criteria that are positioned higher than some items. It means that these items have not achieved all criteria for assessment aspect. Besides that, it also describes the most difficult criteria to achieve, namely criteria "K7". Besides the difficult criteria, there are four criteria that occupy the lowest position in logit scale, which means that these criteria are the easiest to achieve by all questions, namely the K1, K4, K5, and K9 criteria.
The strata value obtained from the expert's assessment is equal to 3,37 which indicates that the validator's assessment achieve the reliable criteria. Meanwhile, the reliability value of the validator obtained was 0.84 with a very good category. This reliability value indicates the reliability of experts in providing assessments on items (Nisa & Yusmaita, 2022). The third column describes exact agreement value with a percentage of 88.5%, which is not so far from expect agreement (model estimate) with a percentage of 88.8%. This means that there is a fit between the results of expert assessment and estimated by the model (Eliza & Yusmaita, 2021). Table 3 shows a summary of the result from expert assessment analysis.

Conduct Pilot Testing
The validated products, were tested to 30 students 12 th grade high school 3 Padang who had learned acid-base material. According to (Sumintono & Widhiarso, 2013), this sample size achieved the minimum requirements for testing an essay test using the Rasch model. In order to avoid high bias scores, before students are given test, they are first taught the acid-base material so that students can remember the material again. If the students are ready, so they are asked to answer on 24 sub-questions of essay within 60 minutes individually. The student's answer data is corrected and given a score based on the reference of multirepresentation understanding and scoring guidelines that have been created.

Analyze Data using Rasch Model
Student's score data were analyzed using Rasch model with the helping of ministep software. The quality of items reviewed included validity, reliability, difficulty index and item discriminatory.

Validity of Items
The items validity can be determined from output items fit order with 3 fit criteria, namely Outfit Mean Square (MNSQ), Outfit Z-Standard (ZSTD), and Point Measure Correlation (Pt-Mean Corr). According to (Planinic et al., 2019), from these three criteria, not all of them must achieve the "accepted" value for an item to be said as "valid". If only one criteria is achieved, so the item can still be said to be valid. However, in the results of this analysis there are no items that only achieve one criteria. The minimum criteria to be achieved are as many as 2 from 3 fit criteria for each items. Meanwhile, the most difficult criteria to achieve were Outfit Pt-Mean Corr. Even though there are 15 items less than acceptable limit for this Outfit, but the Outfit ZSTD and MNSQ Outfit scores can be achieved by other items. Table 4 is the item fit order which shows the result of content validity analysis according Rasch model.

Reliability of Items
Reliability analysis on Output Summary Statistics in Rasch model provides information on the Cronbach's Alpha value of 0.83. Cronbach's Alpha value is used to measure overall reliability by looking at the interaction between person and item (Widhiarso & Sumintono, 2015). Reliability explains whether an instrument provides the same or consistent information if repeated measurement tests are carried out (Pratama, 2020). With a Cronbach Alpha value of ≥ 0.8, it means that the reliability of the instrument is in very good category. In addition, to know the reliability of items, it can see at the "item reliability" value. Figure 3 shows the result of Rasch's analysis about item reliability.

Item Discriminatory
The discriminatory item explains how good the level of item to compare individuals who have high and low abilities. The different power of items is also analyzed using the Output Summary Statistics in Figure 3. The grouping of different power items can be viewed from the value of separation. If the separation value is greater, so the quality of differentiating power of all items and respondents is also better, because it can distinguish groups of items and respondents (Widhiarso &. Sumintono, 2015). With a separation value of 1.99 then H = [(4×1.99) + 1])/3 = 2.98. The number 2.98 is rounded to 3. This means that there are 3 groups of items, namley easy, medium, and difficult.

Difficulty Index
Analysis of difficulty index can be viewed from the Output Item Measure which is presented in Figure 4. The item difficulty index is obtained from a combination of the mean value and the Standard Deviation (SD) value. The average logit value is 0.00 and the standard deviation is 1.16. Then the average value of the logit measure is 1.16 logit. Figure 4 shows the group of item difficulty index. Based on the results of item measure analysis, one outlier item was obtained, item 7a with a logit value of +2.65 because it exceeded the +2.32 logit value. Besides that, no other outlier items were found, so the difficulty index of the items can be grouped, consisting of variations that are easy, medium, difficult, until very difficult (Palimbong et al., 2018). Figure 5 is item measure that shows the difficulty index analysis by Rasch model.

Review Item Compatibility
A review of the items is carried out based on analysis four criteria in previous stage. In the validity analysis, all items are fit and in accordance with the model. While the reliability analysis with a reliability item value of 0.8, it means that item reliability is good. Difficulty index analysis provides information that there are variations of items that are easy, medium, difficult, and one item is very difficult. For discriminatory items have three different power which are considered appropriate and good according to the model.

Review Wright Map
Wright map analysis provides comprehensive information about distribution abilities of students and the level of difficulty items on same scale. The left side of Wright map is the distribution of students' abilities, and the right side is the distribution of difficulty levels of items. There are two students with the highest position which means they have the highest ability, namely L05 and P18 with logit values > +2. Meanwhile, students with the lowest abilities were in the lowest position, namely P12 with a logit value of 0. Besides that, the most difficult item was item 7a because it was in the top position and outside the T (outlier) limit. The items with the lowest logit value (≤ -2 logit) are items 3b and 2b, but not outliers. In this case, it means that students have more correct answers to these items (Sabekti & Khoirunnisa, 2018). Figure 6 shows the distribution Wright map of students and items.

Figure 6. Wright Map of Students and Items
Based on the Wright map analysis, it was found that one item was distributed outside the T limit, namely item 7a. Item 7a assess understanding to macroscopic level. While the item that are classified as difficult include item 8a and 4b. Item 8a tests sub-microscopic understanding, and item 4b tests symbolic understanding. The items that are relatively easy are item 2b, 3b, and 6a with a logit below -1.16. Item 2b and 3b test symbolic understanding, and item 6a tests macroscopic understanding. This is indicated by the item of students who answer incorrectly on item 2b, only one subject, item 3a only four subjects, item 6a only one subject who answered incorrectly, and a small number of students answered incompletely and most answered correctly. Most of the items were considered fit and had a good and even distribution of difficulty levels. However, there is one item at very top, item 7a (outlier) which is considered for revising the items and re-testing. Even so, the distribution of the items displayed on the Wright map is Medium Outlier Easy diffficult mostly in the good category, so deviation of one item (outside the T limit) in the Wright map can still be tolerated.

Repeat Steps 4-7 until All Items Fit
The expected quality of the instrument has been achieved with evidence the results of analysis validity, reliability, difficulty index and discriminatory of items that are good and in accordance with the model. However, when viewed from the Wright map analysis, there is one item outside the T (outlier) limit, namely item 7a. however, item 7a is considered still suitable for use because the distance is not too far from the previous item and person, at +2.65 logit. Although this item needs to be considered for revision and re-testing, the second trial and re-testing were not carried out due to limited research time.

Establish Test Instrument Quality Claims
The validity of all items has been detailed in Table 4. All items can be claimed to be valid because they achieved at least 2 from 3 criteria for MNSQ, ZSTD, and Pt-Mean Corr in the Item Fit Order table. As for the reliability aspect of the test instrument, it can be claimed to be reliable based on the summary statistical data in Figure 3 with item reliability value = 0,80 and Cronbach's Alpha = 0,83 in very good category. Likewise with the difficulty index and discriminatory items obtained which have 3 variations level of questions from easy, medium until difficult questions.

Develop Test Instrument Documentation
Test instrument documentation needs to be developed to provide information for teachers and students in using instrument. So that the information obtained is more complex related to the characteristics of the instrument being developed (Sabekti & Khoirunnisa, 2018). The documents required for this test instrument are in the form of learning progression, test instrument grids, items (including covers and general instructions), assessment rubrics, and guidelines for achieving students' understanding levels in macroscopic, sub-microscopic, and symbolic representations.

Conclusion
Based on the results of this study, it can be concluded that instruments for testing understanding at the macroscopic, sub-microscopic and symbolic levels of high school students in acid base material that has been developed using Rasch model achieve the criteria of being valid, reliable, having a good difficulty index and item discrimination, as well as there is a fit between the result of the validity approval by the expert and the estimated model.