I realized that I have a combinatorics thread running
through my blog across several subjects.
I have been interested in combinatorics since I sent an email to Robert
Spitzer on the various combinations of diagnostic criteria. His only comment was “Interesting”. Since then, I have commented on a post that
purported to discredit psychiatric diagnoses based on combination of diagnostic criteria (too many), a study of the real combinations of major depression diagnoses, and character and word phrase combinations for encryption and
password protection. I went as far as
getting dice and using them to construct passphrases of varying length using
the Electronic Frontier Foundation (EFF) word list for that purpose.
If you have no experience with combinations or it has been a
long time since your college statistics course – dice are a good place to
start. Each die has 6 sides with
corresponding numbers. The total combinations possible are 6n, where
n = the number of dice rolled at once. The
EFF world list is 6,667 word long and that happens to be 66. So, to generate passphrases – 5 dice are
rolled and the corresponding number is looked up on the word list and
recorded. The process is repeated until
the desired phrase length is generated.
The only downside to this method is that some sites still insist on
additional numbers and special characters.
They can still be inserted in the passphrase, but other systems like
hexadecimal may be more convenient. The
advantage to passphrases is that they are theoretically easier to memorize and
type without error. That breaks down
with very long phrases.
In biology and medicine, combinatorics can be applied at several
levels. Some have more meaning than others.
On this blog, I responded to a paper suggesting that the possible
combinations of diagnostic criteria meant that psychiatric diagnoses were meaningless
and unscientific. The lesson from
this post is to have an idea of what you are counting and what it means. The
total combinations of verbal criteria depend a lot on the phrasing and the
total number of criteria whether large or small is not necessarily
disqualifying as illustrated in this post.
The combinatorial upper limit can be unrealistically large based on how
it is defined and just running the numbers does not mean that all possible
combinations will be found. There also seems
to be some magical thinking involved – just because you count something does
not say anything about what that means.
It is quite literally an exercise in the map
is not the territory.
I looked at a second paper where the authors looked at a
lower number of combinations based on the DSM diagnostic criteria for major
depression. In that case the total
number of diagnoses was much lower at 227 combinations. The authors of that second paper did
standardized interviews on 3,800 people and of the 1,566 with major depression
– just 10 of those combinations accounted for 50% of the cases. About ¼ of the possible combinations (57/227)
did not occur in any group. This paper
is a stark reminder that just counting things in biology or medicine doesn’t
necessarily mean anything.
That brings me to the concept of how we make sense out of
the most valid combinatorial explosions in medicine. For me validity is
baked into the biology and not a verbal description of things. The backing for that comes from biological
taxonomy and the fact that molecular biology and genomics is solving problems
that could not be solved by the verbal description of direct observations in
the Linnean tradition. To that end I am
reproducing a table below that is all about the polygenic risk for bipolar disorder.
Note that in this table the authors are estimating the total
possible combinations of 803 polygenes. The theoretical number of possible
combinations can be calculated using the formula n! / r!((nr)!,where
n represents the number of genetic variants analyzed in a study, and
r represents the number of genetic variants per combination. In the case of
SNP genotypes,3^r.the formula is n! / r!(n-r)! ×3^r. The authors point out that the lowest value for r is 2 but the upper
limit is unknown. They also show how the
number of combinations can be limited experimentally. Of the 57,911,211
combinations found only in patients and not controls they could all be random
but there were a significant number of SNPs associated with different groupings
in bipolar disorder.
Using the equations from above in a more readable graphic
form:
Substitution yields the following:
- from the top equation, for 100 variants the theoretical
10-variant combinations would be 1.73 x 1013
- from the bottom equation, for 500,000 SNPs analyzed there
would be 2.3 × 1012 two-variant combinations and 3.4 × 1018
three variant combinations.
The application of practical measure includes scanning SNPs
for varying combination lengths in the population of interest relative to
controls. At lower numbers those combinations can be taken out scanning for
longer combinations. A further simplification is to scan only for combinations
found in patient populations. An example
of that study is included in the tables below for 803 SNPs in 607 bipolar
disorder patients and 1,354
controls.
Cluster and subgroup analysis is required in very
heterogeneous conditions to analyze clusters containing a specific SNP, the
distribution of SNP genotypes relative to controls, and cluster selection that
contains an SNP for a specific biological function. Using this kind of analysis 73/609 bipolar
disorder patients had these clusters compared to none in the control
population.
While the SNP and variant analysis in 2017 is a good example
of combinatoric applications – it did not address the problem of missing
heritability. Missing heritability is
the difference between what is observed in familial heritability studies and
what is predicted with genetic analysis.
Looking at the predictions from SNP based analysis only a low percentage
of familial inheritance was predicted.
That improved with more sensitive analytical techniques that considered
additional genetic mechanisms. The
additional mechanisms included SNV (single nucleotide variation), insertions or
deletions (indels), SVs (structural variations), CNV (copy number variations),
and STR (short tandem repeat (3-5). Applications
that identify all these variations are much more likely to predict the
heritability of the pedigree than earlier techniques. I hope to revisit some of these genetic
innovations in an upcoming post about the DSM-6 proposals.
George Dawson, MD, DFAPA
References:
1: Mellerup E, Møller
GL. Combinations of Genetic Variants Occurring Exclusively in Patients. Comput
Struct Biotechnol J. 2017 Mar 10;15:286-289. doi: 10.1016/j.csbj.2017.03.001.
PMID: 28377798; PMCID: PMC5367802.
2: Koefoed P,
Andreassen OA, Bennike B, Dam H, Djurovic S, Hansen T, Jorgensen MB, Kessing
LV, Melle I, Møller GL, Mors O, Werge T, Mellerup E. Combinations of SNPs
related to signal transduction in bipolar disorder. PLoS One. 2011;6(8):e23812.
doi: 10.1371/journal.pone.0023812. Epub 2011 Aug 29. PMID: 21897858; PMCID:
PMC3163586.
3: Behera S, Catreux
S, Rossi M, Truong S, Huang Z, Ruehle M, Visvanath A, Parnaby G, Roddey C,
Onuchic V, Finocchio A, Cameron DL, English A, Mehtalia S, Han J, Mehio R,
Sedlazeck FJ. Comprehensive genome analysis and variant detection at scale
using DRAGEN. Nat Biotechnol. 2025 Jul;43(7):1177-1191. doi:
10.1038/s41587-024-02382-1. Epub 2024 Oct 25. PMID: 39455800; PMCID:
PMC12022141.
4: Wainschtein P,
Zhang Y, Schwartzentruber J, Kassam I, Sidorenko J, Fiziev PP, Wang H, McRae J,
Border R, Zaitlen N, Sankararaman S, Goddard ME, Zeng J, Visscher PM, Farh KK,
Yengo L. Estimation and mapping of the missing heritability of human
phenotypes. Nature. 2026 Jan;649(8099):1219-1227. doi:
10.1038/s41586-025-09720-6. Epub 2025 Nov 12. PMID: 41225014; PMCID:
PMC12851931.
5: Grotzinger AD,
Werme J, Peyrot WJ, Frei O, de Leeuw C, Bicks LK, Guo Q, Margolis MP, Coombes
BJ, Batzler A, Pazdernik V, Biernacka JM, Andreassen OA, Anttila V, Børglum AD,
Breen G, Cai N, Demontis D, Edenberg HJ, Faraone SV, Franke B, Gandal MJ,
Gelernter J, Hatoum AS, Hettema JM, Johnson EC, Jonas KG, Knowles JA, Koenen
KC, Maihofer AX, Mallard TT, Mattheisen M, Mitchell KS, Neale BM, Nievergelt
CM, Nurnberger JI, O'Connell KS, Peterson RE, Robinson EB, Sanchez-Roige SS,
Santangelo SL, Scharf JM, Stefansson H, Stefansson K, Stein MB, Strom NI,
Thornton LM, Tucker-Drob EM, Verhulst B, Waldman ID, Walters GB, Wray NR, Yu D;
Anxiety Disorders Working Group of the Psychiatric Genomics Consortium;
Attention-Deficit/Hyperactivity Disorder (ADHD) Working Group of the
Psychiatric Genomics Consortium; Autism Spectrum Disorders Working Group of the
Psychiatric Genomics Consortium; Bipolar Disorder Working Group of the
Psychiatric Genomics Consortium; Eating Disorders Working Group of the
Psychiatric Genomics Consortium; Major Depressive Disorder Working Group of the
Psychiatric Genomics Consortium; Nicotine Dependence GenOmics (iNDiGO)
Consortium; Obsessive-Compulsive Disorder and Tourette Syndrome Working Group
of the Psychiatric Genomics Consortium; Post-Traumatic Stress Disorder Working
Group of the Psychiatric Genomics Consortium; Schizophrenia Working Group of
the Psychiatric Genomics Consortium; Substance Use Disorders Working Group of
the Psychiatric Genomics Consortium; Lee PH, Kendler KS, Smoller JW. Mapping
the genetic landscape across 14 psychiatric disorders. Nature. 2026
Jan;649(8096):406-415. doi: 10.1038/s41586-025-09820-3. Epub 2025 Dec 10. PMID:
41372416; PMCID: PMC12779569.


