Sampling and randomisation in experimental and quasi-experimental CALL studies: Issues and recommendations for design, reporting, review, and interpretation

The majority of research papers in computer-assisted language learning (CALL) report on primarily quantitative studies measuring the effectiveness of pedagogical interventions in relation to language learning outcomes. These studies are frequently referred to in the literature as experiments , although this designation is often incorrect because of the approach to sampling that has been used. This methodological discussion paper provides a broad overview of the current CALL literature, examining reported trends in the field that relate to experimental research and the recommendations made for improving practice. It finds that little attention is given to sampling, even in review articles. This indicates that sampling problems are widespread and that there may be limited awareness of the role of formal sampling procedures in experimental reasoning. The paper then reviews the roles of two key aspects of sampling in experiments: random selection of participants and random assignation of participants to control and experimental conditions. The corresponding differences between experimental and quasi-experimental studies are discussed, along with the implications for interpreting a study’s results. Acknowledging that genuine experimental sampling procedures will not be possible for many CALL researchers, the final section of the paper presents practical recommendations for improved design, reporting, review, and interpretation of quasi-experimental studies in the field.

Keywords

Type Research Article Information ReCALL , Volume 36 , Issue 1 , January 2024 , pp. 58 - 71 Creative Commons

This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.

© The Author(s), 2023. Published by Cambridge University Press on behalf of EUROCALL, the European Association for Computer-Assisted Language Learning

1. Introduction

Much like the use of randomised clinical/control trials in medical contexts, experimental studies are a high-status research genre within computer-assisted language learning (CALL) (Gillespie, Reference Gillespie 2020; Golonka, Bowles, Frank, Richardson & Freynik, Reference Golonka, Bowles, Frank, Richardson and Freynik 2014; Macaro, Handley & Walter, Reference Macaro, Handley and Walter 2012). However, the distinction between experimental and quasi-experimental studies is not consistently recognised or understood within CALL literature, as many quasi-experimental studies are mislabelled as experimental studies (Section 2). The APA Dictionary of Statistics and Research Methods defines an experiment as

a series of observations conducted under controlled conditions to study a relationship with the purpose of drawing causal inferences about that relationship. An experiment involves the manipulation of an independent variable, the measurement of a dependent variable, and the exposure of various participants to one or more of the conditions being studied. Random selection of participants and their random assignment to conditions also are necessary in experiments [emphasis added]. (Zedeck, 2014: 397)

In contrast, it describes a quasi-experimental design as an “experimental design in which assignment of participants to an experimental group or to a control group cannot be made at random” and notes that “this is usually the case in field research” (p. 872). Experiment-like designs that do not randomly select participants can also be considered quasi-experimental designs. This distinction between experimental and quasi-experimental designs is important when we reflect upon the logic of experimental reasoning. Randomised selection and assignment have major implications for probabilistic generalisation from a study’s results. This is the case irrespective of any inferential statistics used, as the legitimacy of such analyses is contingent upon study design. As Gass, Loewen and Plonsky ( Reference Gass, Loewen and Plonsky 2021) observe with reference to applied linguistics in general, accepting weak study designs and correspondingly dubious interpretations of findings is a threat to the credibility of the field. CALL, just like other fields of applied linguistics, needs to take such appraisals seriously. Consequently, it is essential that CALL researchers, reviewers and readers understand the difference between experimental and quasi-experimental studies.

This methodological discussion paper begins with an overview of common practice within the field based on recent review papers. It then outlines the role of randomisation in sampling, both assignation of participants to conditions and selection of the sample in the first place, clarifying the distinction between experimental and quasi-experimental designs. After considering the relationship between these sampling issues and experimental reasoning, the final section of the article attempts to reconcile the logic of experimental reasoning with the practical constraints that researchers work within, providing recommendations for the design, reporting, reviewing, and interpretation of quasi-experimental studies within the field.

2. Literature review: The state of the field

Few review papers report explicitly on the sampling approaches adopted in the studies reviewed. Unusually, Wang and Devitt ( Reference Wang and Devitt 2022) do directly discuss the sampling approaches taken in the subset of experiment-like studies that they reviewed. They note that only three papers randomly assigned participants to control and experiment conditions, correctly equating random assignation to conditions as central to experimentation. They report a further nine studies as quasi-experimental because their designs are similar to experiments in the use of pre- and post-tests, but they lack random assignation of test subjects to test conditions. A second review paper that reports on sampling approaches is Boulton and Cobb’s ( Reference Boulton and Cobb 2017) meta-analysis of corpus use in language learning. Their analysis indicates that only 42% of the papers that reported on the constitution of their control and experimental groups used random assignment to conditions (14 out of 33 papers). But more worryingly, 42 of the 88 experimental groups that they identified gave no indication of how participants were assigned to groups at all. Together, these review papers suggest that between 25% and 16% of experiment-like CALL papers use random assignment. These figures also suggest that this problem is more common in CALL as compared with other areas of applied linguistics, as Plonsky and Gonulal ( Reference Plonsky and Gonulal 2015) reported that across subfields, 37% of applied linguistics studies used randomisation.

Several other CALL review papers allude to issues around sampling and distribution of test subjects to test conditions but do not report relevant data. For instance, both Lin ( Reference Lin 2015) and Manca and Ranieri ( Reference Manca and Ranieri 2016) mention experimental and quasi-experimental studies, although neither state the relative proportions observed. Macaro et al. ( Reference Macaro, Handley and Walter 2012: 26) report that “in many studies … the basis of assignment to experimental and control conditions was not stated”. In their overview of 350 studies, Golonka et al. ( Reference Golonka, Bowles, Frank, Richardson and Freynik 2014: 92) note “convenience samples of existing classes [and] lack of suitable control group” as significant challenges to the generalisability of the body of work reviewed, which suggests that sampling problems are widespread. Similarly, Gillespie ( Reference Gillespie 2020: 140) notes in his review of 777 articles that there is a “pattern of researchers being the teachers and using their own students as the subjects of investigation”. At a minimum, this indicates that studies are commonly sampling from uninformatively narrow populations of students, but it seems more likely that Gillespie is actually identifying common use of convenience samples. Peterson et al. ( Reference Peterson, White, Mirzaei, Wang, Kruk and Peterson 2020) also seem aware of similar issues in the studies that they review, noting that many study results may be the product of research being conducted in particular contexts, presenting limits to their generalisability.

It is striking that few review papers even acknowledge the difference between experimental and quasi-experimental designs in the studies that they review, and that even when this distinction is acknowledged, only two papers then separated these two kinds of studies when reporting their findings (Boulton & Cobb, Reference Boulton and Cobb 2017; Wang & Devitt, Reference Wang and Devitt 2022). Furthermore, none of these review papers reported on the sampling frames used, although the widespread use of convenience samples and intact classes was a common global evaluation of the literature under review, as indicated above. It is concerning that many review studies do not appear to recognise any distinction between experimental and quasi-experimental studies, referring to both types of design as experiments. Extrapolating from the recommendations for research practice provided in review papers, there seems to be a belief that more studies, in more settings, with more learners, will allow results to be aggregated and thereby strengthen the body of findings as a whole.

It might be tempting to think that statistical meta-analyses will be able to resolve generalisability issues and make sense of a rapidly increasing, but frequently inconsistent, body of research literature. However, this is a misunderstanding of the intended purpose of meta-analysis and its vulnerabilities. Meta-analysis most clearly addresses issues of statistical power (Sheskin, Reference Sheskin 2020). That is, meta-analysis addresses insufficient sample size, not problems with sampling approach and valid interpretation. Ross and Mackey ( Reference Ross and Mackey 2015: 220) make the following observations:

A key advantage of meta-analysis is that it provides the basis for what researchers can expect to find if studies similar to those in the meta-analysis are replicated. When meta-analyses result in a nonzero effect size, further experimentation anticipating a zero-effect outcome, that is, one that establishes significance in relation to a null hypothesis, eventually becomes superfluous.

Meta-analysis assumes a high degree of homogeneity among the studies included in the analysis, and differences between the studies in a meta-analysis constitute a significant methodological challenge. In the Handbook of Parametric and Nonparametric Statistical Procedures, Sheskin ( Reference Sheskin 2020: 1725–1764) states that

pooling the results of multiple studies that evaluate the same hypothesis is certainly not a simple and straightforward matter. More often than not there are differences between two or more studies which address the same general hypothesis. Rarely if ever are two studies identical with respect to the details of their methodology, the quality of their design, the soundness of execution, and the target populations that are evaluated. [This is problematic because …] studies should not only exhibit consistency with regard to the direction of their outcome, but more importantly should exhibit consistency with respect to the magnitude of the effect size present in the k studies … [but, realistically,] there are probably going to be differences in methodology and/or the types of subjects employed in some or all of the k studies, and it could be argued that the latter factors could be responsible for yielding different effect sizes.

Measures of variation within the pooled data of a meta-analysis, such as standard deviations and confidence intervals, are essential to interpreting the results of a meta-analysis. The more heterogenous the studies pooled, the wider such measures will be. Hence, it should not be entirely unexpected when Lin ( Reference Lin 2015: 276) reports of her meta-analytic review of computer-mediated communication that “there is a vast range in the magnitude of effects, clearly indicating a large standard deviation (SD = 0.65) – larger in fact than the value of the average effect size”. Indeed, very large standard deviations and/or confidence intervals are typical of the results of meta-analyses reporting on the kind of heterogenous studies found in CALL (e.g. Boulton & Cobb, Reference Boulton and Cobb 2017; Cobb & Boulton, Reference Cobb, Boulton, Biber and Reppen 2015; Lee, Warschauer & Lee, Reference Lee, Warschauer and Lee 2019; Zhao, Reference Zhao 2004; Ziegler, Reference Ziegler 2016). Consequently, while we might be confident in the capacity of meta-analyses to indicate whether a certain type of CALL treatment can have a positive effect on learning overall, they are far less likely to identify the conditions under which a particular operationalisation of an intervention can be expected to be effective. Hence, meta-analyses in CALL will typically reveal that a type of treatment (e.g. gamification) can have a wide range of effects, sometimes very effective, sometimes marginally effective, sometimes less effective than the control. However, heterogeneity of studies notwithstanding, even modest meta-analytic claims about general effectiveness should be tempered by the well-recognised file-drawer issue (Sheskin, Reference Sheskin 2020). That is, studies that are available for inclusion in an analysis will tend to be positively biased because they are more likely to be published than studies with non-significant results. There are techniques for estimating the extent of such biases, but the file-drawer issue exacerbates interpretation issues when meta-analyses explore very heterogenous studies, reporting corresponding variability.

Meta-analytic reviews do have a valuable role to play, but they cannot be expected to solve the sampling issues confronting experimenters. As Plonsky and Ziegler ( Reference Plonsky and Ziegler 2016: 29) argue, one of the most important roles is to “describe and evaluate the research and reporting practices of the domains they review”. But the informativeness of meta-analysis results themselves depends on the extent to which genuinely comparable studies are available. In practice, CALL meta-analyses rarely provide a clear indication of the effectiveness of any particular treatment due to major differences between the studies analysed. In the words of Sheskin ( Reference Sheskin 2020: 1746),

one should view a combined effect size value with extreme caution when there is reason to believe that the k effect sizes employed in determining it are not homogeneous. Certainly, in such a case the computed value for the combined effect size is little more than an average of a group of heterogeneous scores, and not reflective of a consistent effect size across studies.

3. Experiments, sampling, and random assignation

Experimentation is concerned with observing the relationship between manipulated independent variables (typically pedagogical interventions in CALL) and dependent variables (for instance, some kind of learning outcome). The aim of experiment design is to create conditions in which sending a signal via manipulation of the independent variables allows the researcher to clearly observe the response in the dependent variables. Variation in the observations that is not caused by manipulation of the independent variable is considered background noise, obscuring observation of the signal–response relationship. For this reason, other sources of variation are also referred to as nuisance variables (Winer, Brown & Michels, Reference Winer, Brown and Michels 1991). We can think of an experiment design as trying to control, or neutralise, noise from nuisance variables so that a signal–response relationship between the independent and dependent variables can be observed.

Nuisance variables can make it impossible to tell whether changes in the dependent variable resulted from changes in the independent variable or were due to variation in the nuisance variables. For example, if I am interested in using a new technology for language learning, but the motivation levels of my students vary, motivation would be a nuisance variable, and my study would want to ensure that the effect of motivation does not obscure observation of any differences in learning outcomes that may result from the pedagogical intervention. If I use the new technology with a group of highly motivated language learners, and use an old technology with a group of poorly motivated language learners, it is not clear whether the independent variable or the nuisance variable accounts for differences between the observations. The nuisance variable, motivation, is obscuring observation of the variable that the experiment sought to explore. When nuisance variables may have obscured our observations of the signal–response relationship between the independent and dependent variable, they are said to have presented a confound, invalidating the results of the experiment.

Major nuisance variables can be controlled through various different aspects of experiment design, such as blocking (a topic beyond the scope of this paper). However, it is widely recognised that it is not practically possible to control all possible nuisance variables (Shadish, Cook & Campbell, Reference Shadish, Cook and Campbell 2001; Winer et al., Reference Winer, Brown and Michels 1991). Fortunately, randomisation provides a means of neutralising the detrimental effects of large numbers of nuisance variables:

The principle in this latter case is that all variables not controlled experimentally or statistically should be allowed to vary completely at random … The outcome is that over large numbers of subjects the unique characteristics of subjects which are not controlled are distributed evenly over the treatment conditions, the primary purpose of this being to remove bias from the estimates of treatment effects. (Winer et al., Reference Winer, Brown and Michels 1991: 8)

As well as explaining why sufficiently large sample sizes are needed, the main point here is that randomisation is the researcher’s best hope that noise from nuisance variables is unbiased, and so does not present a confound. In other words, with a sufficiently large sample, “Random assignment creates two or more groups of units that are probabilistically similar to each other on the average” (Shadish et al., Reference Shadish, Cook and Campbell 2001: 13). With sufficiently large samples, random assignation of participants to control and experimental treatments should result in nuisance variables being evenly distributed across experimental and control conditions.

In relation to using intact classes in studies (i.e. when an entire class is assigned to a condition), the experimental unit is the class, not the individuals in that class. This sampling approach can be referred to as cluster sampling. Here, each class assigned to a condition should be treated as one experimental unit (i.e. as one participant). Power analysis provides the most defensible indication of the number of experimental units a study should use (Faul, Erdfelder, Lang & Buchner, Reference Faul, Erdfelder, Lang and Buchner 2007). A study using cluster sampling will need as many intact classes as the power analysis indicates. So, for instance, if a power analysis indicated 30 participants per condition were sufficient to achieve desirable statistical power, the experiment could proceed with 30 intact classes per condition under the same logic.

To illustrate the problem with using intact classes but analysing individual participant results, imagine two intact classes of students: they might have two different teachers, be taught at different times of day, have better or worse classrooms, different timetables – there are numerous variables within this context that might have an effect on the results observed. If the individual students are not assigned to conditions randomly, the potential nuisance variables are systematically distributed across conditions, representing a confound in the design. Hence, intact classes must be treated as individual observations from a design perspective. Studies with only two intact classes have only two experimental unit observations, which is clearly insufficient for random assignment to deliver probabilistic equivalence. This is a major reason for distinguishing between experimental and quasi-experimental designs:

Quasi-experimental control groups may differ from the treatment condition in many systematic (non-random) ways other than the presence of the treatment. Many of these ways could be alternative explanations for the observed effect, and so researchers have to worry about ruling them out in order to get a more valid estimate of the treatment effect. By contrast, with random assignment the researcher does not have to think as much about all these alternative explanations. If correctly done, random assignment makes most of the alternatives less likely as causes of the observed treatment effect at the start of the study. (Shadish et al., Reference Shadish, Cook and Campbell 2001: 14)

Random assignation at the level of experimental units is a hallmark of experimentation because it provides assurance that noise from nuisance variables is unbiased and affects both treatments equally.

Shadish et al. ( Reference Shadish, Cook and Campbell 2001) are, unsurprisingly, sympathetic to quasi-experimental designs, but they caution that quasi-experimental designs are fundamentally falsificationist. That is, reasonable interpretation of quasi-experimental results requires a programme of experimentation to systematically examine and dismiss alternative explanations derived from plausible nuisance variables. Without such an examination, the independent variable is just one of many possible explanations for the results observed. Shadish et al. ( Reference Shadish, Cook and Campbell 2001: 14–15) note that

in quasi-experiments, the researcher has to enumerate alternative explanations one by one, decide which are plausible, and then use logic, design, and measurement to assess whether each one is operating in a way that might explain any observed effect. The difficulties are that these alternative explanations are never completely enumerable in advance, that some of them are particular to the context being studied, and that the methods needed to eliminate them from contention will vary from alternative to alternative and from study to study … Obviously, as the number of plausible alternative explanations increases, the design of the quasi-experiment becomes more intellectually demanding and complex – especially because we are never certain we have identified all the alternative explanations. The efforts of the quasi-experimenter start to look like attempts to bandage a wound that would have been less severe if random assignment had been used initially.

Clearly, designing and performing an effective programme of quasi-experimentation is extremely challenging in the context of CALL because of the vast range of factors we know to affect pedagogical outcomes. The commonly reported practice of using intact classes in CALL studies greatly reduces the interpretability of the findings that such studies report. The quality and interpretability of CALL studies would be greatly improved by using genuinely experimental designs that randomly assign participants to treatments. As Ryan ( Reference Ryan 2007: 6) states, “Randomization should be used whenever possible and practical so as to eliminate or at least reduce the possibility of confounding effects that could render an experiment practically useless”. However, before turning to ways of mitigating such issues in quasi-experimental designs (Section 5), it is important to consider the second requirement of experimental sampling: participant selection from sampling frames.

4. Experiments, sampling, and random selection

Thinking about the logic of experimental reasoning, interpreting experiment results as generalisable to a wider population beyond the test subjects depends on the relationship between the sample of test subjects and the population they were drawn from. Under the heading Sampling and causal generalization, Shadish et al. ( Reference Shadish, Cook and Campbell 2001: 23) note that random selection of participants provides the strongest rationale for ensuring logical connections between an experiment’s result and claims to generalisability. They distinguish random assignment of participants to treatment conditions from random selection of participants in the first place. However, both are important and derived from the same core rationale: the need to neutralise the effect of unknown sources of variance.

Random selection of participants addresses questions about whether different results would be obtained by conducting the experiment with different participants. If participants are chosen at random from the whole population that an experimenter wishes to generalise to, randomisation provides a means of evenly distributing noise derived from differences among individuals in the population. That is, while random assignation of participants to conditions addresses biases derived from which participants were observed in which treatment conditions, random selection of participants from a population addresses potential biases derived from which participants were included in the study.

To illustrate, imagine a study that involves two classes of postgraduate students from a top university, taught by the researcher, in a particular country where graduation is impossible without passing a specified language exam. Regardless of whether participants are randomly assigned to treatment conditions, it is reasonable to question whether a study would observe a different pattern of results if conducted with participants of a different age, at a different education level, taught by a different teacher, in a different country, at a different institution, with a different relationship to high-stakes language assessments, and so forth. It is random selection of participants from the whole population that the experimenter wishes to generalise to that provides the logic for generalisation by promising to distribute differences between included and excluded participants evenly. The practical issues with such an approach to experimentation are well recognised, but this does not change the facts: “mere membership in the sample is not sufficient for accurately representing a population” (Shadish et al., Reference Shadish, Cook and Campbell 2001: 472). Rather, it is “a researcher who randomly samples experimental participants from a … population [who] may generalize (probabilistically) from the sample to all the other unstudied members of that same population” (Shadish et al., Reference Shadish, Cook and Campbell 2001: 22). This is a major practical and methodological barrier to conducting experimental research in CALL.

In terms of a practical barrier, random selection of participants has profound implications for the logistics of actually conducting experimental research on any broadly defined population. For instance, a study wanting to generalise to EFL learners would need to randomly sample from all EFL learners around the world. It is not clear how such a study could be conducted by an ordinary CALL researcher; it would be a challenging undertaking even for a large, well-funded international research group. However, the practical challenges also point to the more abstract methodological barrier.

Logically, we can only randomly select participants if we have a sampling frame: a list or some other representation of the population that facilitates selecting a sample in “such a way that (1) all elements of the sample have an equal and constant chance of being drawn on all draws and (2) all possible samples have an equal (or fixed and determinable) chance of being drawn, [as then] the resulting sample is a random sample from the specified population” (Winer et al., Reference Winer, Brown and Michels 1991: 13). However, for many populations of language learners, it is not obvious how a sampling frame could be created. For instance, education ministries might be able to provide lists of students studying particular languages in state schools, and such a list would be suitable for delineating those populations of learners, in that country, studying those languages, at state schools, and so it would be possible to randomly select participants from this population, but it does not seem remotely feasible to draw up a list of language learners in some more general sense. In effect, this methodological constraint would seem to preclude experimental research on language learners in a broad, general sense. As Jessen ( Reference Jessen 1978: 160) famously cautioned, “Some very worthwhile investigations are not undertaken at all because of the lack of an apparent frame; others, because of faulty frames, have ended in a disaster or in cloud of doubt”. It may be inconvenient, but the warrant for generalisation is provided by random selection of subjects from a credible sampling frame: a convenience sample provides no defensible basis for making formal, probabilistic generalisations.

Unsurprisingly, the typical study within CALL employs no reported sampling frame (e.g. Macaro et al., Reference Macaro, Handley and Walter 2012; see also Section 2), and so it is not possible to generalise the results of such studies to any identifiable population. Technically, the results reported in such studies only inform us about the participants observed, a kind of quantitative case study. The application of inferential statistics to such studies, although commonplace, is not hugely informative, because we have little ability to identify the population that inferences can be applied to.

The relationship between sample and population is also the major reason why pre-testing to establish group equivalence is not sufficient, even though such tests should be encouraged as a check on whether random assignation has successfully neutralised nuisance variables within the sample itself. Such tests provide a good indication that the experimental and control group are equivalent on a selected variable, but they cannot assure us that the sample is probabilistically typical of a wider population.

In summary, an understanding of experimental reasoning clearly shows that randomisation is integral to the design of generalisable experimental studies, both randomisation of participants to conditions and randomised selection of participants from a specified population. Without such randomisation, a study is quasi-experimental at best, and results should be interpreted with corresponding caution. That is, although quasi-experimental studies can demonstrate that a particular outcome is possible under certain conditions, it cannot readily delineate the conditions under which such results can be obtained: in effect, it provides a proof of concept, but the absence of randomisation prevents such studies from providing generalisable insights. Studies that use intact classes, snowball sampling or other forms of convenience sampling are quasi-experimental, should be identified as such, and interpreted this way.

5. Ways forward

The position outlined above, although well established in technical research methods literature, is likely to be unpalatable to many CALL researchers. In fact, one reviewer of this paper questioned whether it even makes sense to pursue experimental methods in relation to something as complex as CALL, pointing out that many CALL researchers have rejected a research model positing dependent and independent variables, opting instead to explore learning environments from an ecological perspective in which the effectiveness of factors within an environment are viewed as inseparable and interdependent (e.g. Marek & Wu, Reference Marek and Wu 2014). But, sympathy for this position notwithstanding, it does not change the fact the experimental research paradigm is still the predominant research paradigm within CALL, as shown in the literature review above, and yet few CALL researchers are in a position to undertake experimental research once we recognise the necessity of randomised assignment and selection in experimental research. There are many practical constraints on experiment design in an area as complex as language learning, including very limited funding and resources, codes of ethics and professionalism that would seem to conflict with the implementation of randomisation procedures, and immense pressure to produce a high rate of publication in prestigious journals (Colpaert, Reference Colpaert 2012). But the gap between experiment design theory and achievable practice does not justify inaccurate or misleading reporting. Rather, it necessitates careful assessment of how research can be best conducted and reported within practical constraints. Because quasi-experimental methods are likely to remain the predominant quantitative paradigm within CALL, this section of the paper presents recommendations for how quasi-experimental studies can be conducted and reported in a way that will enhance their quality and informativeness. Three main recommendations are made for improving the design, reporting and reviewing of quasi-experimental studies:

  1. 1. Transparency in reporting and interpreting quasi-experimental studies.
  2. 2. Understanding the kind of research questions that basic quasi-experimental designs can address.
  3. 3. Triangulating quasi-experimental data with other non-experimental data via
    1. (a) more measures of participants in relation to a greater range of potential nuisance variables;
    2. (b) providing a thicker description of the participants and research context;
    3. (c) combining quasi-experimental methods with qualitative research instruments suited to exploring the potential for nuisance variables to have affected the results obtained from a quasi-experimental study.

    The first and most obvious step towards better quality research of this kind in CALL would be for more researchers to acknowledge the distinction between experimental and quasi-experimental research. As discussed above, a real experiment is premised upon specific sampling procedures that provide a warrant for generalising the findings (Shadish et al., Reference Shadish, Cook and Campbell 2001; Winer et al., Reference Winer, Brown and Michels 1991). Even when readers are unaware of the logic that underpins experimental research, there is a common understanding that experiments provide generalisable findings. Consequently, misrepresenting quasi-experimental research as experimental research is hugely problematic. It suggests a level of generalisability to findings that is simply not warranted. Articles claiming to present experimental research should provide a clear statement on three crucial aspects of the design: first, how participants were assigned to experimental conditions or treatments; second, how participants were selected from the population of interest; and third, a description of the sampling frame used. To be considered experimental research, the first two statements need to meet a definition of randomness, as discussed earlier in this article. The third condition needs to be met for clarity around the population that experimental results are to be generalised to. When studies are unable to provide satisfactory statements on these key aspects of experiment design, they should be described as quasi-experimental studies. Furthermore, alongside issues of nomenclature, researchers should also make sure that their interpretation of findings parallels the warrant for generalisability inherent in the design decisions made. When quasi-experimental designs are used, researchers should be correspondingly modest in regard to the generalisability claims that they make.

    In a strict sense, quasi-experimental studies are not generalisable. They provide evidence of whether a treatment can, not whether it does. This is what Shadish et al. ( Reference Shadish, Cook and Campbell 2001) mean when they point out that quasi-experimental designs are fundamentally falsificationist. Without a warrant for generalisation to a population, analysis of a quasi-experimental study simply helps clarify the pattern of results observed in that sample. So, for instance, suppose we questioned whether learners could learn from an online resource as easily as they could learn from a paper-based resource. A quasi-experimental study could attempt to falsify this claim. But if a quasi-experimental study shows a difference in results in favour of the online treatment, we have demonstrated that at least some learners in some circumstances can learn more that way. However, for the reasons discussed in this paper, the quasi-experimental design does not entitle the researcher to make any formal claims about whether this relationship exists more generally in some wider population: it cannot inform us about whether this relationship does occur generally, irrespective of the many variables that were not controlled for or neutralised. It does not inform us about when we should expect this relationship to hold.

    Constraints on formal generalisation from quasi-experimental studies lead to the second recommendation: Researchers should think carefully about which research questions a basic quasi-experimental design can answer, and which of these research questions are genuinely worth asking. Let us consider a typical study in which a treatment (for instance, learning language via a mobile app) is compared against a control or alternative treatment (e.g. paper-based study). Does anyone seriously doubt that it might be possible, for some people, in some circumstances, to learn more using a mobile app than using paper-based resources? If such a proposition is not contested, there seems little justification for attempting to falsify it. Basic quasi-experimental studies have the most value in cases where there is legitimate doubt as to whether a treatment could possibly lead to an effect, or where a treatment is posited to always lead to a particular effect. Even just one study with results counter to such always and never hypotheses can disprove them. Basic quasi-experimental designs can address these kinds of research questions effectively.

    To illustrate, Gillespie ( Reference Gillespie 2020) remarks positively that studies comparing paper-based with computer-based interventions are becoming increasingly rare, and so it seems highly doubtful that there really is anyone who seriously questions the potential for computer-based methods to be as effective as paper-based methods, in at least some circumstances. We might contrast this with learning vocabulary by studying concordance lines. At some historical point, it probably did seem entirely reasonable to question whether this was even possible. However, once a quasi-experimental study has shown that a particular relationship is possible (e.g. Cobb, Reference Cobb 1999), what value is there in repeating this ungeneralisable test? As meta-analyses of corpus-based approaches have consistently shown (Boulton & Cobb, Reference Boulton and Cobb 2017; Cobb & Boulton, Reference Cobb, Boulton, Biber and Reppen 2015; Lee et al., Reference Lee, Warschauer and Lee 2019), learners most assuredly can learn vocabulary more successfully when studying concordances than under various alternative or control conditions. But they also indicate that the amount of benefit obtained varies considerably, and in some cases it proves ineffective. Hence, what we are currently unsure of is the generalisability of any particular observation, and to answer this question, experimental designs are needed. So, to summarise, when researchers are not in a position to conduct experimental studies, but they are able to conduct a basic quasi-experimental study, they need to think carefully about the kinds of questions that quasi-experimental studies can answer and which of those questions are of interest to the CALL community.

    A third way we might look to overcome the limitations of quasi-experimental designs would be to enhance basic quasi-experimental designs by supplementing the basic designs with additional sources of information that can facilitate transferability, or naturalistic generalisation – informal processes of speculation about how the results might relate to other contexts (Duff, Reference Duff, Chalhoub-Deville, Chapelle and Duff 2006). When researchers report that their study was conducted using a convenience sample, such as a snowball sample or using their own classes, their academic integrity should be applauded but, in effect, this just alerts the reader to the fact that randomisation was not used, the study is quasi-experimental, and so the results are not formally generalisable. In essence, quasi-experimental studies are quantitative case studies. While the careful researcher should certainly acknowledge that the generalisability of the findings reported are in doubt – for instance, by alerting the reader that it is unclear whether these results would hold for other groups of learners – such acknowledgement should not entirely preclude a degree of reasonable speculation about how quasi-experimental findings might relate to broader populations. However, if researchers do want to try to promote naturalistic generalisation or transfer from their quasi-experimental findings, then they need to actively facilitate this process for their reader. That is, they need to provide as much information about their participant sample and research context as possible. For every possible nuisance variable, a description or assessment of the participants in relation to that variable strengthens the reader’s confidence in transferring the study results to similar populations. There are three main ways that researchers can do this, and all of them concern better reporting in relation to potential nuisance variables.

    One possible means of facilitating transfer would be to report more measures of participants in relation to a greater range of potential nuisance variables. We already see this in some studies that report proficiency measures for participants as a means of demonstrating parity between control and experimental groups. While demonstrating parity in proficiency levels is a welcome design facet because it can reassure us about assignment of participants to conditions in relation to that particular variable, it is probably most valuable in quasi-experimental designs by helping readers naturalistically generalise. So, for instance, if we know the learners in a study had a very advanced knowledge of the target language, we might see the results as potentially relevant to an advanced class that we teach, but also be less confident in seeing those findings as relevant to a class of beginners, for example. There are many instruments available in the broader applied linguistics literature that might help researchers describe their participants more informatively, such as measures of motivation, vocabulary size, strategy use, and so on. This does, however, bring to mind once more Shadish et al.’s ( Reference Shadish, Cook and Campbell 2001: 14–15) observation about the impossibility of systematically addressing all possible variables, making neutralisation via randomisation look appealing once more.

    A second way that researchers could improve the interpretability of quasi-experimental studies is simply providing a thicker description of the participants and research context. While it might not be possible to measure every possible nuisance variable and report this formally, there is still value in simply describing the participants and context in a way that helps readers get a clear picture of the who, where, why, what, when and how of the project. It is reasonably common for researchers to report on the gender of their participants, age, proficiency level and/or language background of participants, and this does go some way towards facilitating naturalistic generalisation, but it is far less common to find a description of the educational environment and prevailing attitudes to language education in that context, factors that seem just as likely to have a profound effect on how any pedagogical intervention plays out (The Douglas Fir Group, 2016; Duff, Reference Duff, Chalhoub-Deville, Chapelle and Duff 2006). For instance, if a study is undertaken at a low-status, rather authoritarian, vocational training college where students lack motivation for education in general and teachers complain that students are only interested in passing the exam, this would seem to be very pertinent information for interpreting the study results, as we can imagine that such contextual factors are likely to have a strong influence on the results. Thicker descriptions of both participants and research contexts would go a long way to making quasi-experimental studies more genuinely interpretable and informative for readers.

    Finally, going beyond informal but informative description, there is great potential for quasi-experimental research to be combined with qualitative research instruments (Hashemi & Babaii, Reference Hashemi and Babaii 2013). The fundamental issue with quasi-experimental designs is that the absence of randomisation fails to neutralise nuisance variables, making their influence on the observed results unknown. In contrast, rather than testing the effect of a pre-selected variable, qualitative methods are well suited to exploratory research, asking what factors are operative in an environment. As such, qualitative research instruments have the potential to examine the extent to which nuisance variables may have affected the results obtained from a quasi-experimental study. For instance, researchers could use pre- and post-intervention interviews, observation, or stimulated recall to explore the extent to which unanticipated variables may have influenced the results obtained in the quasi-experimental analysis. This kind of triangulation of research instruments could go a long way to overcoming the most serious shortcomings of quasi-experimental designs. Similar recommendations have been made in the broader context of education research wherein Cebula ( Reference Cebula, Hamilton and Ravenscroft 2018), for instance, connects the prevalence of quasi-experimental designs in education research to the ongoing (un)replicability crisis in psychology and other fields.

    6. Conclusion

    CALL has developed rapidly as a field, and it has been quick to adopt sophisticated research techniques from more established fields with long traditions of experimentation, but, crucially, some key elements of experimental design and reasoning have been largely ignored. The logic of experiment design necessitates the use of randomisation to control for nuisance variables: random selection (i.e. selection by lot) from a population, and random assignation of participants to control and experimental treatments or conditions. When these conditions are not met, we are actually employing a quasi-experimental design. At a bare minimum, this distinction should be understood and acknowledged. Quasi-experimental designs are never conclusive, even at a theoretical level, and whether similar results would be obtained in different samples affected by different distributions of nuisance variables remains an open question. Given the difficulty of designing and conducting a genuinely experimental study in an area as complex as CALL, it seems very unlikely that true experimental designs will be widely adopted. Instead, it seems almost certain that quasi-experimental designs will continue to be the mainstay of research in our field. Consequently, it is not only essential that researchers acknowledge when the designs that they use are quasi-experimental and moderate their claims for generalisability appropriately, but also important for researchers to adopt research practices that will facilitate transfer and naturalistic generalisation. This is not simply a question of larger samples or more studies in more diverse contexts. Rather, steps should be taken to directly address the shortcomings of quasi-experimental designs. To such an end, three practices appear promising: first, reporting participant metrics that help to define the sample examined, such as proficiency scores or attitudinal data; second, providing rich, thick description of the participants and context; and third, triangulating quasi-experimental designs with qualitative research instruments designed to explore whether nuisance variables may have significantly affected the results reported in the quasi-experimental portion of the study.

    Finally, as one reviewer pointed out, although this paper has focused on research within CALL, it is by no means clear that experiment design practices are superior in other areas of applied linguistics, or even other fields, such as education, psychology, or sociology. The observations presented in this paper regarding experimental and quasi-experimental research and their respective interpretability are relevant to all fields of research, both within applied linguistics and beyond.

    Ethical statement and competing interests

    The author declares no competing interests.