Transposable elements (TEs) represent approximately 40% of the human genome and are increasingly recognized as dynamic regulators of genomic architecture and gene expression. Despite their prevalence, the role of TEs in complex disease aetiology remains underexplored. To our knowledge, this research represents the first study to investigate the contribution of TEs to cardiovascular disease (CVD) risk.
Using Illumina whole-genome sequencing (WGS) data and two TE detection tools, we identified and characterized TE variants from 940 Busselton Health Study participants. CVD cases were defined from morbidity and mortality health-linked data from the WA Department of Health for 768 participants using ICD codes. We applied logistic PCA, UMAP visualization and a clustering algorithm to high-dimensional TE variants data and to specific CVD outcomes, aiming to understand whether TE variant distribution is associated with CVD.
We identified 18,516 non-reference retrotransposons, consisting of 12,616 Alus, 4954 LINE1s, and 946 SINE-VNTR-Alus (SVAs) across all 940 WGS. Variants that have an insertion frequency above 0.01 (5047 TEs) were used in logistic PCA and K-means clustering that distributed CVD outcomes into three partially overlapping clusters. These three clusters broadly corresponded to acute myocardial infarction, atrial fibrillation, and a control group. A chi-square test with Monte Carlo simulation (10,000 replicates) showed significant associations between these clusters and the CVD outcomes (p< 0.001). A Random Forest classifier was then used to identify key TEs contributing to cluster labels and based on mean aggregated feature importance scores (above 0.5), 14 TE variants (13 Alus and 1 LINE1s) were identified as key contributors to cluster separation. This approach offers insights into the significance of TE variants in complex diseases such as CVD and highlights their potential as genetic markers influencing complex disease risk.