In today’s column, I explore and analyze the results of a recent research study that examined the efficacy of using a specially tuned generative AI to perform a limited range of mental health therapy occurring over an eight-week period. Subjects were generally monitored in a devised experimental setting. The uptake is that the treatment-group participants appeared to benefit from the use of the tuned generative AI, spurring improvements in dealing with various mental health conditions such as depression, weight-related concerns, and anxiety.
This is an encouraging sign that generative AI and large language models (LLMs) provide a potential facility for adequately performing mental health therapy. Still, important caveats are worth noting and require further study and consideration.
Let’s talk about it.
This analysis of AI breakthroughs is part of my ongoing Forbes column coverage on the latest in AI, including identifying and explaining various impactful AI complexities (see the link here).
AI And Mental Health Therapy
I’ve been extensively covering and analyzing a myriad of facets of contemporary AI that generates mental health advice and undertakes interactive AI-driven therapy. This rapidly increasing use of AI has principally been spurred by the widespread adoption of generative AI and large language models (LLMs).
There are tremendous upsides to be had, but at the same time, hidden risks and outright gotchas come into these endeavors too. I frequently speak up about these pressing matters, including in an episode of CBS 60 Minutes, see the link here. For a quick summary of some of my posted columns on AI for mental health therapy, see the link here, which recaps forty of the over one hundred column postings that I’ve published on the evolving topic.
Background On AI For Mental Health
Active and extensive research on the use of AI for mental health purposes has been going on for many years.
One of the earliest and most highly visible instances involved the impacts of a rudimentary form of AI known as Eliza during the 1960s, see my discussion at the link here. In that now famous or classic case, a simple program coined as Eliza, echoed user-entered inputs and did so with the air of the AI acting like a therapist. To some degree, this was a surprise to everyone at that time. The mainstay of the surprise was that a barebones computer program could cause people to seemingly believe they were conversing with a highly capable mental health professional or psychologist.
Almost a dozen years later, the legendary astrophysicist and science communicator, Carl Sagan, made a prediction in 1975 about the eventuality and inevitably of AI acting as a psychotherapist for humans. As I have discussed about his prophecy, at the link here, in many notable ways he was right, but in other facets, he was a bit off and we have not yet witnessed the fullness of his predictions.
During the heyday of expert systems, many efforts were launched to use rules-based capabilities to act as a therapist, see my discussion at the link here. The notion was that it might be feasible to identify all the rules that a human therapist uses to perform therapy and then embed those rules into a knowledge-based system.
The upside of those expert systems was that it was reasonably plausible to test the AI and gauge whether it would dispense proper advice. A builder of such an AI system could exhaustively examine various paths and rules, doing so to try and ensure that the expert system would not produce improper advice. Parlance in the AI field is that this type of AI was considered to be deterministic.
In contrast, and a disconcerting issue with today’s generative AI and LLMs, is that the latest AI tends to work on a non-deterministic basis. The AI uses statistics and probabilities to generate the responses being emitted to a user. In general, it isn’t feasible to then fully test such AI since the outputs are somewhat unpredictable.
It is for that reason that we need to be particularly cautious in promoting generative AI and LLMs as handy aid for performing therapy. People are doing so anyway, and are often unaware that the AI could give out untoward advice, including incurring so-called AI hallucinations that give unsupported made-up contrivances (see my explanation at the link here).
I’ve repeatedly noted that we are amid a grand experiment that involves the world’s population and the use of generative AI for mental health advisement. This AI-based therapy is being used actively at scale. We don’t know how many people are avidly using LLMs for this purpose, though guesses reach into the many millions of users (see my analysis at the link here).
An intriguing tradeoff is taking place before our very eyes.
On the one hand, having massively available AI-based therapy at a near-zero cost to those using it, and being available anywhere at any time, might be a godsend for population-level mental health. The qualm is that we don’t yet know whether this will end up as a positive outcome or a negative outcome. A kind of free-for-all is taking place and seemingly only time will tell if this unfettered unfiltered use of AI will have a net positive ROI.
Recent Research Study Takes A Close Look
A recent research study opted to take a close look at how a specially tuned generative AI might perform and did so in a thoughtfully designed experimental setting. We definitely need more such mindfully crafted studies. Much of the prevailing dialogue about this weighty topic is based on speculation and lacks rigor and care in analysis.
In the study entitled “Randomized Trial of a Generative AI Chatbot for Mental Health Treatment”, Michael V. Heinz, Daniel M. Mackin, Brianna M. Trudeau, Sukanya Bhattacharya, Yinzhou Wang, Haley A. Banta, Abi D. Jewett, Abigail J. Salzhauer, Tess Z. Griffin, and Nicholas C. Jacobson, New England Journal of Medicine AI, March 27, 2025, these key points were made (excerpts):
- “We present a randomized controlled trial (RCT) testing an expert–fine-tuned Gen-AI–powered chatbot, Therabot, for mental health treatment.”
- “We conducted a national, randomized controlled trial of adults (N=210) with clinically significant symptoms of major depressive disorder (MDD), generalized anxiety disorder (GAD), or at clinically high risk for feeding and eating disorders (CHR-FED).”
- “Participants were randomly assigned to a 4-week Therabot intervention (N=106) or waitlist control (WLC; N=104). WLC participants received no app access during the study period but gained access after its conclusion (8 weeks).”
- “Critically, as compared with the WLC, Therabot users showed a greater reduction in depression, anxiety, and CHR-FED symptoms at postintervention (4 weeks) and at follow-up (8 weeks).”
- “Secondary outcomes included user engagement, acceptability, and therapeutic alliance (i.e., the collaborative patient and therapist relationship).”
- “Fine-tuned Gen-AI chatbots offer a feasible approach to delivering personalized mental health interventions at scale, although further research with larger clinical samples is needed to confirm their effectiveness and generalizability.”
Readers deeply interested in this topic should consider reading the full study to get the details on the procedures used and the approach that was undertaken.
Mixing And Matching Is Afoot
Some additional historical context on these matters might be beneficial.
There have been a number of prior research studies focusing on principally expert-systems-based AI for mental health therapy, such as a well-known commercial app named Woebot (see my analysis at the link here), a rules-based app named Tessa for eating disorders (see my discussion at the link here), and many others.
Those who have rules-based solutions are often seeking to augment their systems by incorporating generative AI capabilities. This makes sense in that generative AI provides fluency for interacting with users that conventional expert systems typically lack. The idea is that you might get the best of both worlds, namely the predictable nature of an expert system that combines with the highly interactive nature of LLMs.
The challenge is that generative AI tends to have the qualms I mentioned earlier due to its non-deterministic nature. If you blend a tightly tested expert system with a more loosey-goosey generative AI capability, you are potentially taking chances on what the AI is going to do while dispensing mental health advice.
It’s quite a conundrum.
Another angle is to see if generative AI can be bound sufficiently to keep it from going astray. It is conceivable that with various technological guardrails and human oversight, an LLM for mental health use can be reliably utilized in the wild. This has spurred an interest in devising highly customized AI foundational models that are tailored specifically to the mental health domain, see my discussion at the link here.
Thoughts On AI Research In This Realm
Let’s shift gears and consider the myriad of research pursuits from a 30,000-foot level. We can garner useful insights into how such research has been conducted, and how such research might be further conducted on an ongoing basis and in the future too.
Here are five notable considerations that are worthwhile contemplating:
- (1) Developers and researchers at times intertwined. Studies in this realm often entail the researchers also being the developers of the AI-based mental health app. The good news is that by being both the developer and the researcher, they know the app in great detail and can even potentially hone the system accordingly. The downside is that there is a potential conflict of interest since there can be a subconscious propensity to want the AI to be a success. It will be helpful if studies involving researchers who have no connection whatsoever to a given AI system can pick up the mantle and perform distinctly third-party independent studies.
- (2) Comparison of AI to nothing else is a bit soft. Studies in this realm will typically devise an experimental setting that consists of the AI app as the treatment and then have a control group that doesn’t make use of the AI. That’s fine. It is useful to see whether the AI appears to make a difference in mental health outcomes. But the rub is this. An argument can be made that just about anything might produce upside mental health outcomes. In that sense, the AI is almost like a placebo (so the argument goes). To aid in dealing with this possibility, a simultaneous use of a group that is for example encouraged to access online information about mental health can be a handy comparison basis. How much of an uplift, if any, does the AI app have over simple access to ready information about mental health? That’s a difference of a more demonstrative kind.
- (3) Amount of outside handling. When the AI app falters or perhaps detects worrisome concerns based on what the user has expressed such as possible self-harm, these experiments typically and rightfully have a process in which humans are notified, such as a standing-by mental health professional. Those outside touchpoints are not usually given much attention by the media that reports on these studies. That’s unfortunate. The implication that the media tends to portray is that the AI was somehow fully autonomous. Knowing how much outside handling was encountered can be a bellwether for what such AI might need in real-world usage.
- (4) Degree of tailorization to the AI. Another aspect that is important to consider is how much has a generative AI or LLM been tailored to perform mental health advisement. Is it a minor adjustment or a major overhaul? Again, the media at times runs with the notion that generic generative AI has been substantiated due to a given experiment, but that could be a misleading or false indication. The AI used in the study might not resemble generic generative AI anymore. Thus, trying to compare the tailored version versus what people tend to use every day is an entirely offside apples-to-oranges comparison.
- (5) Subjects versed in generative AI. A tough issue to increasingly cope with will be that subjects used in this realm are likely to already have used generative AI, perhaps doing so daily. The question arises as to whether this makes them more prone to use the tailored AI app than those who do not avidly use generative AI. It will be important for studies to ask subjects about their prior and existing use of generative AI. This will help to clarify what impact such usage has on those in the treatment group and those in the control group. Indeed, if the control group is not using the tailored AI app, they might be secretly using generic generative AI during the course of the experiment anyway. That’s worth knowing.
Big Picture Of The Biggest Kind
The research studies in this realm that aim to be highly methodical and systematic will typically make use of the longstanding time-tested practice of RCT (randomized control trial). This consists of devising an experimental design that randomly assigns subjects to a treatment or experimental group, and other subjects to a control group. Such a rigorous approach aims to try and prevent confounding factors from getting in the way of making suitable claims regarding what the research identified and stipulates as strident outcomes.
First, let’s give suitably due credit to those studies using RCT.
The amount of time and energy can be substantial. Unlike other research approaches that are more ad hoc, trying to do things the best way possible can be time-consuming and costly. A willingness and persistence to do AI-related research in this manner is exceedingly laudable. Thanks for doing so.
An issue or challenge about RCT is that since it does tend to take a longer time to conduct, the time lag can be a kind of downfall or detractor from the results. This is especially the case in the high-tech field, including AI. Advances in AI are happening very quickly, in the order of days, weeks, and months. Meanwhile, some of these RCT studies take a year or two to sometimes undertake and complete, along with writing up the study and getting it published.
Some would say that these studies often have a rear-view mirror perspective and are talking about the past rather than the present or the future. RCT is somewhat immersed in a brouhaha right now. It is the gold standard and top-notch science work relies on proceeding with RCT. But does the time lag tend to produce results that are outdated or no longer relevant?
In a provocative article entitled “Fixing The Science Of Digital Technology Harms” by Amy Orben and J. Nathan Matias, Science, April 11, 2025, they make this intriguing assertion:
- “When scientists study new technologies, they naturally reach for ‘routine science’ – slow, careful work aimed at minimizing errors that have led to vast amounts of societal progress over generations. Yet the rate of change in digital technology is now outpacing scientific knowledge creation to a degree that it is becoming unusable.”
- “The slow practices of ‘routine science’ could arguably be failing to prevent loss of livelihood, health, and lives.”
The problem that we seem to be faced with is an agonizing choice of longer research that has strong credibility, in contrast to less rigorous approaches that can be undertaken faster but that would then be dicey when seeking to embrace the stated results. I certainly don’t want newbie researchers in AI to think that this hotly debated issue gives them a free ride to ditch RCT.
Nope.
That’s not the answer.
We need solid research in the field of AI for mental health. Period, end of story. The more, the better. Meanwhile, perhaps we can find some middle ground and be able to have our cake and eat it too. Stay tuned as I cover this mind-bending puzzle in-depth in an upcoming posting.
A final thought for now comes from the legendary Marcus Aurelius, famously having made this telling remark alluding to the vital nature of research: “Nothing has such power to broaden the mind as the ability to investigate systematically and truly all that comes under thy observation in life.”
Let’s fully embrace that credo.