Why Don't AI Detectors Work?

In the Scholarly Teaching Unit, we have received many requests for recommendations about AI detectors and/or other prevention strategies. While we would love to be able to recommend a panacea for solving the many problematic applications of generative AI in the classroom, there is currently no fool-proof technological solution to detect AI-generated content in student work or to prevent them from using generative AI. As of mid-2024, no detection service has been able to conclusively identify AI-generated content at a rate better than random chance, and Illinois State University does not have a relationship with any of these services. This page is an overview of how detectors purport to work, why they probably can’t work in their current form and some alternatives.

Print Page

How Generative AI Writes

typing text into a phone as the application predicts content

What we know now as generative AI is a form of machine learning that uses Large Language Models (LLMs) to create text in response to a prompt supplied by the user. We have had this technology for decades in R&D contexts and it has been in consumer technology since the late 2000’s in devices like smartphones. If you open your phone right now, there’s a very high chance your keyboard will begin to suggest words and phrases if you start a new text message.

In a very simplistic sense, LLMs are super-charged predictive text generators. They operate one word at a time, finding the most likely word to supply next. While they do not operate solely on the probability of a word appearing in the language they are working in—they are also trained on which outputs are good or bad for a particular situation—probability is at the core of this technology.

LLMs are designed to produce a simulation of “new” content each time they operate, which they achieve through training. The specific training that a model receives provides it with extra context to go beyond the pure probability of a word appearing in the language generally to the probability of a word appearing within a specific genre or responding to a particular prompt.

How AI Detectors Work

Perhaps ironically, AI detectors work by feeding content through an LLM and using other machine learning techniques to analyze text, so they are themselves generative AI tools. There is not full transparency in how these tools are designed to work with any specificity, however, they are trained like any other model: they are given a large corpus of text and told what is AI-generated and what is not, so they can learn to detect it, combined with their own understanding of the language they are trained on.

In machine learning, perplexity is a measurement of how much a piece of text deviates from what an AI model has learned during its training. As Dr. Margaret Mitchell of AI company Hugging Face told Ars, "Perplexity is a function of 'how surprising is this language based on what I've seen?'"

Robot using a chatbot One way that AI detectors operate is by analyzing the uniqueness of each sentence compared to what they expect to find based on their training—what data scientists call perplexity—meaning that the more probable any word is to come after the word before it, the less novel the sentence is, and the more probable it is to be AI-generated, as LLMs work via probability (Edwards 2023).

To explain perplexity, Edwards gives the example of “I’d like a cup of ___.” As you read that, you probably supplied something like “water” or “coffee.” You probably didn’t say “beer” or “wine” which come in glasses. You almost certainly didn’t say “spiders.” “I’d like a cup of water” has low perplexity. “I’d like a cup of wine” has higher perplexity. “I’d like a cup of spiders” has very high perplexity. So, to a detector, “I’d like a cup of spiders” would be unlikely to be flagged as AI, in theory.

A problem with using perplexity as a measure of whether something was written by AI or not, though, is that we use common sentences all the time—that is what makes them common. This is also very true of academic writing, which uses formulae and common phrases consistently both out of necessity and for the sake of transparent communication. This applies to both novice and expert writers, but writers who are less confident or new to a particular genre may lean more into the language their teachers use or language from examples they have found, which makes them more likely to be flagged.

Human writers often exhibit a dynamic writing style, resulting in text with variable sentence lengths and structures. For instance, we might write a long, complex sentence followed by a short, simple one, or we might use a burst of adjectives in one sentence and none in the next. This variability is a natural outcome of human creativity and spontaneity.

Another property that detectors look at is “burstiness,” which is a measure of the consistency in sentence structure and length in a text. We, as humans, do not tend to write sentences of exactly the same length. We’re taught in writing courses that we should vary our sentence structure and length both for rhetorical impact and to keep our writing from being monotonous. Machines don’t always do this, though. Especially in the early days of ChatGPT, sentence structure tended to be pretty consistent and average, because, again, they work on probability. So, “burstiness” could be a way of identifying human-generated content.

This is problematic as well, though, because some human writers are monotonous. Some genres, like memos or policy documents, tend to require stability in sentence structure as well. Perhaps with both perplexity and burstiness considered, a detector could be developed, but looking for averages to determine whether writing is AI-generated or not is very difficult, precisely because we are often “average” ourselves!

Watermarking eases the detection of LLM output text by imprinting specific patterns on them.

One solution offered to the problem of individuals using LLMs to create text and pass it off as their own is watermarking, which is the process of creating a pattern in the word choice and sentence structure used by an LLM so that a detector can more easily identify that text as being AI-generated. Sadasivan et al note that this watermarking can easily be defeated by putting AI-generated text through a second LLM to paraphrase it. Paraphrasing tools also reduce the effectiveness of perplexity and burstiness-based detection methods (2024).

Studies of AI Detectors

Given the speed at which AI is evolving and the number of tools being created to both generate content and detect that generated content, the number of systematic studies available on this topic in a higher education context is currently limited. However, the studies that are emerging have thus far universally found current models of AI detection to be insufficiently accurate for use in academic integrity cases. Indeed, some studies suggest that their accuracy is low enough that they are far more likely to create situations where students are falsely accused than to detect actual instances of academic dishonesty.

Sadasivan et al in their 2024 study, demonstrate a technique called recursive paraphrasing, which was able to defeat detection services operating with several different modalities beyond the ones discussed on this page, including watermarks. They also demonstrate the potential for bad actors using “spoofing attacks” to further degrade the ability of detectors to function (2024).

A study of submissions to a peer-reviewed medical journal by Cooperman and Brandão showed commercially available AI detectors correctly identifying AI-generated content approximately 63% of the time with false positive rates of 24.5% to 25%. Feeding content through GPT 3.5 to paraphrase it reduced the accuracy of detection by 54.83%. They conclude, “we identified a rapid progression in the advancement of both generative AI writing tools and AI detection tools, indicating an urgent need to identify strategies to safeguard our medical literature” (2).

In a recent study by Liu et al, also of medical writing, the combination of AI detection services with human reviewers was found to be effective in identifying both purely AI-generated content and AI-paraphrased content, in pieces that were either wholly generated by GPT 3.5-Turbo or paraphrased in full. While the results they suggest are promising for at least two of the tools they worked with, they note that with the constant evolution of the models and the advent of newer versions such as GPT 4 and GPT 4-o, content detection may increase in difficulty.

A January 2025 study shows that AI detectors remain consistently inconsistent, sometimes getting close to accuracy but then delivering different scores on the exact same files in subsequent checks. In this study, various adversarial techniques (i.e. using a second LLM to paraphrase AI output) were all effective at defeating AI detectors to various degrees.

Spotting AI Content in Student Work

As Generative AI becomes more and more sophisticated, there are fewer hallmarks of its content that can be described with any certainty. However, the basic principles of its operation tend to lead to writing that is big, but empty. It uses language that feels grandiose but doesn’t actually say very much. Unfortunately, this can also be a feature of novice writers’ work.

Phone with open ChatGPT app with prompt to write a poem While newer models are better at avoiding most of these items, you can look for things such as:

Hallucinations: Hallucinations is the word data scientists use to describe fabrications or inventions by GenAI. An archetypal example of a hallucination is a source that does not actually exist. GPT-3.5 would often create this type of hallucination when a user asked for sources, even if there were sources that would be responsive to their query. If a citation looks strange to you, consider looking it up on CrossRef.org
Repetitive Language: Older iterations of GPT tend to rely more on stock phrases, and don’t necessarily take into account the content of previous sentences when creating the next one, so you may see phrases repeated.
Chat-like Dialogue: When responding to a prompt even newer iterations of GPT will say something like “Sure, I can create that essay for you,” before creating the text they were prompted to make. So, if this remains within the text and wasn’t removed, that is a pretty clear sign AI was used.

Alternatives to AI Detection

Our main recommendation about all forms of plagiarism, and not just AI-generated content, boils down to course and assignment design being the most important tools you have in your toolbox to ensure that students are producing their own content. Even as the models evolve, AI cannot engage in self-reflection for our students and it struggles to perform at the higher levels of complexity on Bloom’s Revised Taxonomy. We have suggestions for incorporating or mitigating the use of AI into your course here.

However, we do have some recommendations related specifically to situations where you want to be relatively sure that your students are not using AI. Keep in mind that these strategies are not fool-proof.

Use Version History and Track Changes

Office 365 supports live AutoSave and version history functionality. So, you could create a shared folder for your course and have your students create their work directly in this folder with AutoSave and track changes on. If they write the entire assignment in this folder, you should be able to see the entire version history for the document, which would indicate whether large amounts of text were copied and pasted, and also allow you to see your students’ progress as they write. You could create a class-wide folder, if you wished to encourage peer review, or individual folders if you don’t want them looking at each other’s work. Keep in mind: you should not grade student work in a folder shared with others, so you would need to save a local copy of the file if you wanted to provide feedback.

Consider Presentations or Oral Exams

While AI can now generate audio and visual content, it cannot give a presentation in front of the class. If your class size allows for it, consider replacing essays or other long-form writing assignments with presentations or oral examinations. This won’t always be appropriate for every class, but it is an assessment that is relatively difficult to plagiarize. You might also consider adding an oral component to existing projects by having a conversation with each student on the work they produced—if it is AI-generated, it will likely be difficult for the student to explain their process or the underlying concepts involved.

Other Options

If the above suggestions won’t work in your context, feel free to schedule a consultation with a member of our staff so that we can discuss other options. This section includes strategies that instructors have tried with some success, which may work for you in your course.

ProctorTrack and Lockdown Browser

The University has access to ProctorTrack and the Respondus Lockdown Browser which can be made available by individual faculty request through submitting a Help Desk ticket. Keep in mind that these tools can’t typically be used together, and they can often be difficult for both instructors and students to use in conjunction with Canvas, depending on the specific assessment type being used. There are also challenges with equipment (students must have a camera and a quiet place to work to use ProctorTrack) and accessibility in some cases. Our position is that these tools should be considered a last resort rather than a go-to strategy.

Conclusion

At the present moment, the Scholarly Teaching team is unaware of any generative AI detection service that functions with sufficient accuracy to provide evidence service that has sufficient accuracy to be used in academic integrity cases. We strongly recommend against putting student work into detectors without their consent, due to the services’ own terms of use and intellectual property concerns. Scholarly Teaching staff are not able to discuss individual student cases, but we can discuss ways you might design learning experiences. If you have a concern about a student’s work, we recommend discussing the situation with the staff at the Office of Student Conduct and Community Responsibilities.

Back to AI Homepage

References

AI Detectors Don’t Work. Here’s What to Do Instead. (n.d.). MIT Sloan Teaching & Learning Technologies. Retrieved July 8, 2024, from https://mitsloanedtech.mit.edu/ai/teach/ai-detectors-dont-work/

Cooperman, S. R., & Brandão, R. A. (2024). AI tools vs AI text: Detecting AI-generated writing in foot and ankle surgery. Foot & Ankle Surgery: Techniques, Reports & Cases, 4(1), 100367. https://doi.org/10.1016/j.fastrc.2024.100367

Liu, J. Q. J., Hui, K. T. K., Al Zoubi, F., Zhou, Z. Z. X., Samartzis, D., Yu, C. C. H., Chang, J. R., & Wong, A. Y. L. (2024). The great detectives: Humans versus AI detectors in catching large language model-generated medical writing. International Journal for Educational Integrity, 20(1), 8. https://doi.org/10.1007/s40979-024-00155-6

Sadasivan, V. S., Kumar, A., Balasubramanian, S., Wang, W., & Feizi, S. (2024). Can AI-Generated Text be Reliably Detected? (arXiv:2303.11156). arXiv. http://arxiv.org/abs/2303.11156

Why AI writing detectors don’t work | Ars Technica. (July 14, 2023). Retrieved July 8, 2024, from https://arstechnica.com/information-technology/2023/07/why-ai-detectors-think-the-us-constitution-was-written-by-ai/

Zawacki-Richter, O., Marín, V. I., Bond, M., & Gouverneur, F. (2019). Systematic review of research on artificial intelligence applications in higher education – where are the educators? International Journal of Educational Technology in Higher Education, 16(1), 39. https://doi.org/10.1186/s41239-019-0171-0

Malik, M. A. & Amjad, A. I. (2025). AI vs AI: How effective are Turnitin, ZeroGPT, GPTZero, and Writer AI in detecting text generated by ChatGPT, Perplexity, and Gemini? Journal of Applied Learning & Teaching, 8(1). https://doi.org/10.37074/jalt.2025.8.1.9