4th School on Data Science and Machine Learning
November 16 – 21, 2025
ICTP-SAIFR, São Paulo, Brazil
Venue: ICTP-SAIFR/NCC-UNESP
Home
The 4th School on Data Science and Machine Learning invites ambitious researchers ready to harness the transformative power of advanced artificial intelligence. Building on our successful four-year legacy, this year’s program features state-of-the-art topics reflecting the rapid evolution of AI in 2025. Machine Learning is revolutionizing every sector of society — from breakthrough medical diagnostics to intelligent systems supporting vulnerable populations to innovative public safety solutions. These advancements aren’t just technological achievements; they are catalysts for new public policies and social frameworks. In this context, equipping researchers with advanced ML knowledge is crucial for the field’s continued development and responsible application.
Target Audience and Learning Approach
Our program is tailored for advanced PhD candidates finalizing research, early-career postdoctoral researchers, professionals seeking to integrate cutting-edge AI into their work, and researchers from diverse backgrounds looking to apply AI to their disciplines. We have designed a balanced learning approach with morning theoretical sessions paired with afternoon hands-on practical exercises that bridge theory and application.
What Makes This School Unique
Our forward-looking curriculum covers foundation models, advanced generative AI, efficient scaling, and frontier applications. We proudly feature leading Brazilian AI researchers alongside international pioneers, creating a rich environment for knowledge exchange. Participants will not just learn algorithms — they will develop critical thinking around AI applications, ethical considerations, and domain-specific implementations.
Networking and Peer Learning
To foster collaboration and peer learning, all participants are required to bring a research poster to be presented during coffee breaks and lunch intervals. This is an opportunity to showcase your work, receive feedback, and engage in meaningful discussions with fellow participants and instructors. The poster should include a concise description of your current research. If machine learning is already part of your work, the poster should explain how it is being used and, if available, present preliminary or final results. These sessions are designed to spark new collaborations and provide insights that can directly benefit your research.
Recommended background
- Early-career researchers (e.g., PhD students, postdocs, or final-year Master’s students)
- Fields: natural sciences, engineering, computer science, mathematics, social sciences, or humanities with interest in AI/ML
- Basic programming knowledge in Python
- Familiarity basic statistics
- No prior machine learning experience required, but interest and motivation are essential
- Proficiency in English (school will be held in English)
Universal Scientific Education and Research Network (USERN) Congress
The 10th Universal Scientific Education and Research Network (USERN) Congress will be hosted by PUC-Campinas from November 8-10, 2025, just before our school. This prestigious international event offers young scientists an exceptional opportunity to present their research through USERN Junior Talks or poster presentations, with all poster participants receiving dedicated 3-minute presentation slots. The congress features exciting competition elements, including Best Talk and Best Poster contests, with top-performing junior researchers earning the prestigious opportunity to present their abstracts as featured “USERN Junior Talks.” Registration is now open with very affordable fees. For details and program information, visit usern.org.
Organizers:
- Raphael Cobe (NCC-UNESP/AI2, Brazil)
- Tommaso Dorigo (INFN-Padova, Italy)
- Sergio F. Novaes (NCC-UNESP/AI2, Brazil)
- Thiago Tomei (NCC-UNESP/AI2, Brazil)
There is no registration fee and limited funds are available for travel and local expenses.
Announcement:
Click HERE for online application
Application deadline: September 16, 2025
Lecturers
Lecturers
- Alexandre Simões (Unesp) – Deep Learning
- Ana Carolina Lorena (ITA) – Explainable AI
- Carolina Gonzalez (Instituto Santos Dumont) – AI in Life Sciences
- Fabio Ortega (USP) – AI for the Climate Emergency
- Rafael Zanatta (USP e Data Privacy Brasil) – Ethics in AI
- Raphael Cobe (NCC-UNESP/AI2, Brazil) – Introduction to Machine Learning
- Renato Vicente (USP) – Introduction to Neural Networks
- Sérgio Novaes (NCC-UNESP/AI2, Brazil) – AI and the University
- Felippe Alves (CIAAM – USP) – AI Benchmarks
- Davi Bastos Costa (CIAAM – USP) – Multi-agent LLM system
Registration
Participants
Posters
Tuesday – Nov 18
- Alves Beneti, Gustavo (Universidade Federal do ABC, Brazil): Evaluation of the Thermal Stability of Organic Mixed Conducting Polymers
Organic conductive polymers revolutionized the exploration of electrical conductivity, establishing a paradigm shift in materials science at the end of the 20th century. Organic mixed ionic-electronic conductors (OMIEC) have proven to be promising for the future development of electronics. This study utilized emerging large language models (LLMs) to identify the key materials mentioned in scientific publications in this field, highlighting two standout polymers: poly(3,4-ethylenedioxythiophene) (PEDOT) and poly(3-hexylthiophene) (P3HT). Additionally, molecular dynamics simulations with the reactive ReaxFF potential were performed to assess the thermal stability of the identified polymers. Moreover, the temporal evolution of the number of molecules and species, as well as the components of the potential energy, were analyzed in the proposed discussions. For PEDOT, it was identified that degradation begins with the breaking of the backbone chain, which directly affects the thiophene rings. In contrast, P3HT, which has side chains, exhibited degradation mechanisms starting with the formation of radicals resulting from the breaking of its side chains. The results contribute to the understanding of the thermal degradation mechanisms of these materials, paving the way for advances in the design of more stable polymers for electronic device applications. Finally, a philosophical investigation regarding science was also conducted, which allowed for the understanding of biases underlying each methodological decision made, as well as an evaluation of the epistemological validity of the computational simulation performedm, which demonstrates that such simulations complement and are equivalent to traditional experimental research, whithout disregarding the dynamism and mutability in scientific methodologies.
- Arneiro, Jhoao Gabriel Martins Campos De Almeida (Instituto de Física da Universidade de São Paulo, Brazil): Deep Learning Methods for Jet Tagging and Process Classification Using Image Processing
The use of neural networks in high-energy physics has rapidly expanded, particularly in jet tagging applications. This study explores a convolutional neural network (CNN) based approach to classify jets produced in high-energy collisions by differentiating between heavy quark (charm, bottom), light quark (up, down, strange), and gluon jets without relying on jet reconstruction. The method constructs image-like representations based on the kinematics of charged decay products using detector-level variables, which allow CNNs to identify visual patterns characteristic of each jet type. This approach demonstrate strong classification performance, highlighting the versatility of CNN architectures in jet tagging and advancing our understanding of jet substructures.
- Arosquipa Yanque, Nury Yuleny (University of São Paulo, Brazil): GLOF-Vulnerability Identification based on Automatic Method for Mapping Glacial Lakes in Cordillera Blanca (Peru)
Glacial lake outburst floods (GLOFs) pose a significant threat to high-mountain communities and infrastructure, particularly in glacierized regions such as the Cordillera Blanca, Peru. This study presents an automated method for mapping glacial lakes using Sentinel-2 satellite imagery with enhanced band combinations and a segmentation-based foundational model (SAM 2.1). Our method enables consistent multitemporal lake extraction, which successfully mapped 80% of 448 manually identified glacial lakes. We performed a comparative analysis of images taken from May 2016 and May 2024, and identified five lakes potentially vulnerable to GLOFs. Notably, Lake Paron and Lake Piticocha experienced significant surface area expansion of 13.79 and 3.72 hectares, respectively. The other three lakes showed significally expansion despite its smaller size. GLOF simulations incorporating local topography and lake size indicate potential impacts on urban areas (e.g., Caraz city) and agricultural land. These findings highlight the importance of standarized and automated glacial lake monitoring for early risk assessment and disaster preparedness in the face of climate-driven glacial change.
- Belem Ribeiro, Crizan (Cruzeiro do Sul, Brazil): Statistical Physics and Machine Learning: Data-Driven Approaches to Complex Systems
This poster presents the use of methods from Statistical Physics combined with Machine Learning to analyze and model complex systems. Using Python (NumPy, Pandas, Scikit-learn, PyTorch), we explore how probabilistic models, Monte Carlo simulations, and neural networks can be applied to study phase transitions, critical phenomena, and emergent behaviors in high-dimensional datasets.
- Benincá, Thalita Sartori (UFES, Brazil): Spectroscopic redshift of galaxies with artificial intelligence
This research aims to apply machine learning techniques to predict the spectroscopic redshift (spec-z) of galaxies using data from large astronomical surveys. Accurate redshift measurement is fundamental for determining the distance and velocity of galaxies, making it essential for understanding the large-scale structure of the universe. In the first stage of the project, the dataset construction and processing was carried out based on Sloan Digital Sky Survey (SDSS) data, particularly the BOSS and eBOSS programs. Following this preparatory phase, the current focus is on implementing and comparing two regression models: XGBoost, which uses decision trees, and deep neural networks (Deep Neural Networks), recognized for their ability to capture complex and non-linear relationships between variables. The methodology includes cross-validation of the models, as well as analysis of metrics such as mean absolute error (MAE) and root mean square error (RMSE), with the aim of identifying the approach with the best predictive performance. Expanding the dataset with information from the DESI survey will allow evaluation of the robustness and generalization of the proposed models in more diverse observational contexts. The results are expected to contribute to improving redshift estimates, as well as providing accessible tools for the scientific community. All source code and applied methodology will be made available in a public GitHub repository, promoting reproducibility and collaborative use. It is concluded that the application of modern artificial intelligence techniques in observational astrophysics offers a promising path to accelerate and refine analyses in cosmology.
- Bernardo, Thamires Alves Silva (Federal Fluminense University, Brazil): Monitoring water stress in coffee plants using stable isotopes (δ¹³C) and TinyML
Water stress is a major limitation to crop productivity. Understanding how plants respond to water limitation is essential for optimizing water-use efficiency and improving resilience to climate variability. Stable carbon isotope signature (δ¹³C) is widely used as proxy for intrinsic water-use efficiency, as they integrate physiological responses related to stomatal conductance and carbon assimilation over time. Bulk-leaf δ¹³C captures longer-term stress patterns, while δ¹³C from leaf glucose reflects more recent environmental conditions. Prior studies have also reported correlations between δ¹³C and leaf temperature, reinforcing its reliability as a drought indicator. However, real-time monitoring of drought stress under field conditions remains a challenge, as isotopic analysis is not suitable for continuous application due to its cost, time demands, and labor-intensive procedures. In this context, TinyML has emerged as a promising tool for agriculture, enabling the detection of early signs of water stress through low-cost, embedded systems. These systems can continuously monitor variables such as leaf temperature, soil moisture, and environmental conditions, and use locally trained TinyML algorithms to detect both short-term stress responses and long-term drought adaptation. This study proposes the use of δ¹³C as a physiological reference to evaluate and validate TinyML models for drought stress detection in coffee plants. This approach improves model robustness and ensures that predictions align with actual physiological responses. While the current prototype is being developed under controlled greenhouse conditions, the framework is designed to scale to more complex cropping systems, such as coffee–banana intercropping, where plant interactions and microclimatic variability further complicate drought dynamics. Ultimately, the goal is to develop accessible, autonomous tools to support sustainable irrigation management in tropical agriculture.
- Brito Granado, Gabriel (IF-USP, Brazil): Automating Anisotropic SAXS Analysis: From Procedural Algorithms to Machine Learning Frameworks
The high volume of data generated in modern synchrotron facilities has created a bottleneck in Small-Angle X-ray Scattering (SAXS) data analysis, particularly for complex 2D anisotropic patterns [1]. To address this problem, we are developing automated methodologies to perform the analysis process, from initial data to a structural model. Our current progress is centered on a functional procedural algorithm that automates data reduction by processing 2D images, identifying regions of interest, and generating 1D intensity profiles. This initial advance enables the automated preparation of data for modeling to extract key structural parameters. The next phase of this work aims to develop this tool further with Machine Learning (ML) applications for predictive modeling [2]. The initial goal is to train Neural Networks (NN) on large synthetic datasets to perform classification of experimental 1D profiles [3]. We have a preliminary method to generate the large synthetic dataset required for training; the protocol involves the generation of 1D scattering profiles in reciprocal space and their corresponding pair-distribution functions in real space [4]. We plan to develop a subsequent regression model NN to estimate the fitting parameters from the scattering curve, making the analysis faster. Our main objective is to create a hybrid framework where ML models provide a fast and accurate prediction for the structural models and their parameters, which is then refined by a least-squares algorithm for fitting [5]. This aims to establish a robust pipeline that will accelerate the structural analysis of complex oriented systems, such as liquid crystals and hair fibers. References: [1] Craievich AF. Synchrotron radiation in Brazil. Past, present and future. Radiation Physics and Chemistry. Volume 167. 2020. 108253. ISSN 0969-806X. [2] Goodfellow I, Bengio Y, Courville A. Deep Learning. The MIT Press (2016). [3] Molodenskiy DS, Svergun DI, Kikhney AG. Artificial neural networks for solution scattering data analysis. Structure 30, 900–908, June 2, 2022. Elsevier. [4] Alves C, Pedersen JS, Oliveira CLP. Modelling of high-symmetry nanoscale particles by small-angle scattering. Journal of Applied Crystallography. 2014. 47, 84–94. [5] Oliveira CLP. Investigating Macromolecular Complexes in Solution by Small Angle X-Ray Scattering. In: Chandrasekaran DA, editor. Current Trends in X-Ray Crystallography: InTech; 2011. p. 367-92.
- Celestino Rocha, Gabriel Wendell (Universidade Federal do Rio Grande do Norte, Brazil): Topological Data Analysis for Gravitational Wave Detection under Low Signal-to-Noise Ratios
The reliable detection of gravitational waves (GWs) in noisy time-series data remains a central challenge in data-driven astrophysics. In this work, we explore Topological Data Analysis (TDA) as a feature-extraction framework for GW-like chirp signals embedded in Gaussian noise. We reconstruct time-delay embeddings of simulated waveforms, construct Vietoris–Rips filtrations, and extract persistence diagrams up to the first homology group. These diagrams are transformed into vectorized summaries, including persistence images, landscapes, and Betti curves, which are then used as input features for standard machine learning classifiers. We systematically evaluate detection performance across a range of signal-to-noise ratios (SNR ∈ [2,10]) and compare TDA-based methods against classical baselines (time-domain and spectral features) and shallow neural models. Our results show that persistence-based descriptors achieve competitive detection accuracy, maintain robustness under decreasing SNR, and provide interpretable insights into the topological structure of GW signals. These findings highlight TDA as a promising tool for enhancing the interpretability and reliability of gravitational-wave data analysis.
- Da Silva, Maria Karolaynne (UFRN/M2Bio, Brazil): Potential inhibitors of Zika Virus Envelope Protein Through Molecular Docking, and Dynamics Simulation Analyses
Zika virus (ZIKV) infection remains a global health threat with no approved antivirals or vaccines to date, creating an urgent need for therapeutics targeting ZIKV. The viral envelope (E) protein is critical for host cell entry and represents a validated target for antiviral intervention. Here, we aimed to identify natural flavonoid compounds capable of inhibiting the ZIKV E protein using a dual-phase in silico screening strategy. First, we performed density functional theory (DFT) calculations to optimize the structures of nine candidate flavonoids and obtain quantum chemical descriptors (electronic properties); we also evaluated their drug-likeness and ADMET profiles. Second, we conducted molecular docking of these optimized flavonoids to the E protein, followed by hybrid quantum mechanics/molecular mechanics (QM/MM) refinement and 100 ns molecular dynamics (MD) simulations with principal component analysis (PCA) and MM-PBSA binding free energy calculations to assess binding interactions and complex stability. Docking identified quercetin, pinocembrin, and naringenin as the top binders, with binding energies of –8.3, –8.1, and –8.0 kcal/mol, respectively. These lead flavonoids also exhibited favorable pharmacokinetic properties, including high predicted gastrointestinal absorption, efficient clearance, and minimal toxicity risk (no carcinogenic or organ-specific alerts). Notably, pinocembrin’s complex demonstrated the greatest stability throughout a 100 ns MD simulation, maintaining a tightly bound conformation. In conclusion, quercetin, pinocembrin, and naringenin emerge as promising ZIKV E protein inhibitors with robust target engagement and favorable drug-like profiles. Their significant translational potential as antiviral candidates warrants further in vitro and in vivo studies to confirm efficacy and safety.
- Da Silva, Robison José Santos (Universidade Federal do Paraná, Brazil): RECURRENCE PLOT IN NEURAL NETWORKS TO ESTIMATE THE HONEY PERCENTAGE IN A SAMPLE
This study focuses on estimating the percentage of honey in samples formed by mixing honey with syrup. Electrical impedance measurements were acquired using a transistor’s capacitive cell, where the sample served as the electrolytic medium. These measurements were then used to generate recurrence plots (RPs) corresponding to different mixture ratios. An RP is built by plotting a time series against itself and identifying points where two states are sufficiently close, based on a defined threshold. Recurrence plots are effective tools for analyzing nonlinear systems, revealing recurrent behavior and hidden structures in the data. Once the RP images were constructed, a Convolutional Neural Network (CNN) was used to classify the samples according to their composition. CNNs are particularly effective for image analysis, as they capture spatial correlations between adjacent pixels, enhancing pattern detection. The results demonstrated that combining RPs with CNNs yields promising outcomes for classification tasks. Additionally, for comparison, a Multi-Layer Perceptron (MLP) was trained on the raw numerical data. While its classification performance was inferior to the CNN, it performed reasonably well in regression, estimating intermediate mixture levels. Future developments will aim to improve the model by including quantitative features extracted from the RPs (such as determinism, laminarity, entropy, and recurrence rate) to construct a more accurate and generalizable classifier.
- De Luna Oliveira, José Arthur (UFRN, Brazil): The role of Stone-Wales defects in the tribological properties of graphene bilayers
Individual graphene sheets exhibit high mechanical stiffness and tensile strength, with values of approximately 1 TPa and 100 GPa, respectively. However, multilayer structures are susceptible to interlaminar sliding due to the weak van der Waals interactions between the overlapping sheets, which results in a low friction coefficient. This characteristic limits the application of graphene in composites and other devices that require greater tensile strength. To overcome this limitation, a promising approach is the introduction of defects, such as Stone-Wales (SW) and single vacancies, to increase surface roughness and, consequently, the static friction coefficient. In this context, the present study investigates the dependence of the frictional force on the concentration and distribution of these defects through Molecular Dynamics (MD) simulations using the LAMMPS software. The computational model consists of two graphene sheets in contact, in which Stone-Wales defects and single vacancies are randomly distributed. One of the sheets is held fixed, while the other is pulled laterally to evaluate the friction force. The results indicate a significant increase in the friction force required to initiate sliding, which was found to be dependent on both the defect concentration and the relative alignment between the layers. Although the insertion of defects can degrade the intrinsic mechanical properties of an isolated graphene sheet, our findings suggest that, for multilayer structures, the controlled presence of defects can enhance the mechanical properties of the system as a whole by creating energy barriers that hinder the sliding between monolayers.
- De Souza, Nayane (Universidade Tecnológica Federal do Paraná, Brazil): Identifying Noncanonical Open Reading Frames with Automated Multiclassification
Noncanonical open reading frames (ncORFs), once considered noncoding, are now recognized as potential sources of functional peptides and regulators of canonical genes. Advances in deep sequencing have revealed their widespread translation and involvement in both physiological and pathological processes, and ncORFs have been identified in both prokaryotes and eukaryotes. Computational detection remains challenging because these ORFs are typically short (fewer than 100 codons), often lack sequence conservation, may initiate at nonAUG start codons, and can display unusual codon usage. This complexity cannot be addressed with tools designed for canonical coding regions, and existing specialized tools are limited by the quality of their positive and negative training datasets. Recently, a consortium proposed standardized experimental techniques, and new high quality datasets have become available. We compiled these datasets to examine diverse biological characteristics including ORF length, transcript of origin, flanking regions, and codon usage, and integrated all features into a multiclassification framework using BioAutoML, an automated machine learning platform. BioAutoML systematically evaluates multiple algorithms to identify the most effective classifiers by combining biological descriptors with statistical and mathematical features. This approach enables scalable analysis of rapidly expanding sequencing data and aims to accelerate the discovery of novel functional elements, deepening our understanding of these noncanonical genomic regions.
- Del Rey, Ângelo Orletti (Universidade Federal de São Paulo, Brazil): Corticalstriatal hypoconnectivity relation to psychotic experiences in a community sample of children and adolencents
Introduction: Delimitation of fundamental concepts such as psychosis, schizophrenia, and psychotic experiences (PE) based on psychiatric clinical practice. PEs are subclinical manifestations of psychosis and exhibit positive and negative subdomains. They can be measured by psychometric scales such as the Community Assessment of Psychic Experiences (CAPE). The psychosis spectrum model, in which we have PEs on one side, moving through clinical high-risk for psychosis (CHR-P) and reaching psychotic disorders, provides the etiological basis for our investigation. The neurodevelopmental theory, the third version of the dopaminergic theory, and the disconnectivity theory define the theoretical basis of the study. We present the principles of functional magnetic resonance imaging (fMRI), BOLD signal acquisition, and hemodynamic response function. From this, we formulate functional connectivity, brain partitioning, and cortical networks. Some studies investigated psychotic experiences and others evaluated corticostriatal connectivity, but none performed these analyses together. Objectives: The main objective is to verify the relationship between corticostriatal connectivity and the total CAPE score for positive symptoms (CAPE-pos) in subjects aged 6 to 13 years. The secondary objective is to analyze this relationship with the subdomains of CAPE-pos (CAPE-PA, CAPE-PI, and CAPE-BE). Methodology: The sample is part of the Brazilian High-Risk Cohort Study (BHRCS), in which the CAPE scale was applied and fMRI and T1 images were acquired. The image was pre-processed following study standards, and processing proceeded at the first level to delimit corticostriatal connectivity and at the second level to delimit the relationship of this connectivity with the CAPE score. Results: A significant negative relationship was found between CAPE-pos and the connectivity of the dorsorostral putamen with regions of the prefrontal and cingulate cortex. For the CAPE-PA subdomain, a negative relationship was observed between the connectivity of the dorsorostral putamen and parts of the prefrontal cortex. Discussion: The pattern of brain connectivity changes with the severity of psychotic symptoms, with hypoconnectivity in the default mode network in PEs. In addition, the literature also points to consistent hypoconnectivity in associative areas up to the initial stages of psychotic disorders. Study limitations include the exclusion of participants due to movement in the MRI scanner and the weak statistical relationship found.
- Dias Guilardi, Mariana (Universidade de São Paulo, Brazil): Metatranscriptomics and machine-learning for the prediction of zoonotic viruses hosted by bats from brazilian Northeastern
Bats are a critical global reservoir of zoonotic viruses, including high-consequence pathogens such as severe acute respiratory syndrome (SARS)- and Middle East respiratory syndrome (MERS)-related coronaviruses (CoVs), henipaviruses, and filoviruses. Bats’ viromes show a variety of viruses of interest to public health, but there are also unknown pathogens that could spread between animals and humans. Techniques used to identify new viruses, such as metatranscriptomics alongside bioinformatics and machine learning, help to detect and discover new pathogens that infect animals and that could be transmitted to humans. In this project, we aim to predict the zoonotic potential of viruses hosted by bats from northeastern Brazil. We sampled eleven pregnant females of three bat species: Desmodus rotundus, Eumops perotis, and Molossus molossus. For each individual, we collected samples from the central nervous system (CNS), gastrointestinal tract (GIT), kidney, and fetus. All the animals are from locations within the state of Ceará, Brazil. First, we performed a viral screening focusing on viruses of interest to human health. Then, we submitted 14 multi-tissue pools to RNA-sequencing with a depth of 60 million paired-end reads (150 bp) on an Illumina NovaSeq 6000 platform. We submitted the acquired reads to Genome Detective, a software that trims adapters, and assembles contigs and genomes using de novo and reference sequence assembly methods. The taxonomic identity of viral sequences will be confirmed using BLAST. For viruses with full genomes that are not recognized as having zoonotic potential, we will submit their genomes to Zoonotic Rank algorithm, in which uses machine learning models, such as a gradient boosting classifier, as well as features from viral and human genome sequences, to predict the probability that an infected animal will transmit a pathogen to human. We will also submit viruses with unknown zoonotic potential and shorter sequences to the BERT-infect model. This software uses large language models (LLMs) that employ features from viral and human genome sequences to create an infectivity prediction model. With these approaches, we expect to identify viruses with zoonotic potential and provide new viral targets for in vivo approaches. This will promote measures for public health strategies aimed at preventing future epidemic and pandemic episodes.
Wednesday – Nov 19
- Exposito De Queiroz, Alfredo A. A. (Instituto Tecnológico de Aeronáutica (ITA), Brazil): Understanding Transferability in Noisy Image Classification Tasks
Transfer Learning is a machine learning strategy that enables the reuse of models trained on large source datasets for solving tasks in data-scarce target domains. It employs methods such as feature extraction and fine-tuning to adapt pre-trained models to new tasks efficiently. A key concept in this process is transferability, defined as a model’s ability to generalize well to a new target task. In this study, we explore transferability in the context of image classification using the Hymenoptera dataset (ants and bees), analyzing how models behave under varying levels of asymmetric label noise. Specifically, we apply uniform noise, in which class labels are randomly misassigned with equal probability across all possible labels. This simulates a common form of label corruption found in real-world datasets. We applied two fine-tuning strategies, standard fine-tuning and Batch Spectral Shrinkage (BSS), a regularization method that adjusts the singular values of the feature matrix to reduce overfitting. To evaluate the difficulty of the classification task and the potential for successful transfer, we used data complexity and instance hardness measures, besides standard transferability measures from the literature: Log Marginal Evidence (LogME), which estimates transferability without fine-tuning via Bayesian linear regression, and k-Disagreeing Neighbors (kDN), which quantifies local class overlap around an instance. The evaluation was conducted across different CNN architectures, using NDCG (Normalized Discounted Cumulative Gain) with k=3 and accuracy as the relevance score to rank model performance. Our results showed that BSS consistently improved LogME after fine-tuning across both noise conditions. Under 0% uniform noise, BSS improved LogME from 0.5635 to 0.9669 and kDN from 0.5614 to 0.5731, while standard fine-tuning improved LogME to 0.7407 and kDN to 0.7524. Under 20% uniform noise, the results were stable: BSS again improved LogME from 0.5635 to 0.9669 and kDN from 0.5614 to 0.5731, while standard fine-tuning showed consistent results (LogME: 0.7407, kDN: 0.7524). These findings suggest that BSS enhances transferability even in the presence of uniformly distributed label noise, especially in terms of LogME. However, kDN values remained lower with BSS under noise, possibly indicating a trade-off between generalization and local class separability.
- Fernandes, Beatriz Silva (Universidade Federal de São Paulo, Brazil): Application of remote sensing techniques for water quality analysis in the Santos Estuary channels (SP, Brazil)
The Baixada Santista region in São Paulo, Brazil, features diverse and vital coastal ecosystems including restingas, rocky shores, and extensive mangroves, which support rich biodiversity and essential ecosystem services. However, intense human activities such as urbanization, tourism, port operations, and industry have led to environmental degradation, particularly impacting water quality. This project analyzes spatial and temporal changes in water quality indicators—chlorophyll-a and suspended solids—in the Santos Estuary channels (Santos, Bertioga, Piaçaguera, and São Vicente) from 2018 to 2024, with a focus on population shifts during the COVID-19 pandemic and environmental factors like tides, rainfall, and hydroelectric discharge. Using atmospheric-corrected Sentinel-2 and Landsat satellite imagery validated against field data, we assessed and mapped water quality trends to better understand ecosystem responses to anthropogenic and natural influences. The study aims to fill critical monitoring gaps and enhance remote sensing methods integrated with in situ data for improved management of Brazilian coastal and marine environments.
- Gelabert, Julián Franco (Balseiro Institute, Argentina): Enhancing Non-Time-of-Flight PET Image Quality Using Deep Learning: A CNN Approach for Scatter and Noise Reduction
Background: Positron Emission Tomography (PET) is a crucial functional imaging technique in oncology. However, clinics utilizing older-generation scanners without Time-of-Flight (ToF) capabilities or Resolution Modeling (PSF) often contend with lower image quality, characterized by high noise levels and scatter contamination. This can compromise diagnostic confidence and quantitative accuracy. Objective: This work aims to develop and validate a deep learning-based post-processing framework to enhance the image quality of non-ToF, non-PSF corrected PET studies, making them comparable to those from more advanced, expensive scanners. Methods: A dataset of 18F-FDG PET studies was used. A convolutional neural network (CNN) architecture was designed and implemented in TensorFlow/Keras. The model was trained to learn a direct mapping from low-quality (non-ToF, non-PSF) PET images to their corresponding high-quality targets. The network’s performance was quantitatively evaluated using standard metrics like Signal-to-Noise Ratio (SNR), Contrast-to-Noise Ratio (CNR), and qualitative assessment by medical physicists. Results: Preliminary results demonstrate that the proposed CNN model significantly improves image quality. Processed images show a marked reduction in noise and improved lesion delineation compared to the original reconstructions. Quantitative analysis confirms a statistically significant increase in SNR and CNR, enhancing the images’ diagnostic utility. Conclusion: Deep learning presents a powerful, software-based solution to mitigate hardware limitations in PET imaging. This research successfully demonstrates a proof-of-concept CNN model that enhances image quality from legacy scanners, potentially improving diagnostic accuracy and extending the functional lifespan of existing medical infrastructure. This approach underscores the transformative potential of machine learning in democratizing access to high-quality medical diagnostics. Keywords: Deep Learning, Convolutional Neural Networks, PET Imaging, Image Reconstruction, Medical Physics, 18F-FDG.
- Gonçalves, Eduardo Sell (Instituto de Física “Gleb Wataghin”, UNICAMP, Brazil): Investigating Degenerate Optical Parametric Oscillation Stability within SiN Microrings
Photonic processors have been explored to address the increase demand for efficient solutions to non-deterministic polynomial-time problems with exponential time and energy scaling. Recent studies have utilized a network of coupled degenerate optical parametric oscillators (DOPOs) based on the $\chi^{(2)}$ nonlinearity to create a time-multiplexed coherent Ising machine. These studies take advantage of the phase transition at the parametric oscillation threshold that results in a binary phase state offset by $\pi$, mimicking the binary spin system. In silicon nitride (SiN) microresonators, third-order nonlinearities have been explored for dual-pumped DOPO in various configurations, including normal group velocity dispersion (GVD) and in a system of coupled microcavities, for true random number generation. This research examines the temporal stability of degenerate signal/idler pair oscillation in SiN microrings with anomalous GVD. Our calculations reveal that higher-order cavity modes are involved: the geometry of the bus waveguide implies a better effective index coupling between its fundamental mode and the first two higher-order modes of the cavity. The experimental setup features dual-pump tones from tunable continuous-wave lasers, pulsed by synchronized electro-optic modulators (EOMs). Pumps are tuned to TE modes from the same family separated by $2\times$FSRs (relative mode numbers $= \pm1$). Independent detuning of each pump was carried out systematically. For each detuning step of the red pump, the entire spectral region near the resonance was scanned by the blue pump, considering thermal and nonlinear phase-shift effects like cross-phase modulation in the scanning region. DOPO’s median amplitude, as well as the interquartile range (IQR) was recorded at each step. Stability is characterized by high amplitude and low IQR. No DOPO is seen when the blue pump is far from resonance. As it nears resonance, DOPO emerges, but significant amplitude fluctuations are observed. Further reducing the pump-cavity detuning intensifies these fluctuations. When the blue pump is finely tuned to resonance, a stable DOPO condition with minimal amplitude variation is achieved, persisting in a narrow detuning region. This is indicative of a controlled and sustained oscillatory state. As the blue pump undergoes additional detuning beyond this stable region, the photothermal pump dragging of the resonance reaches its maximum and terminates the DOPO. This behavior was examined across various red-pump laser detunings. The verified narrowness of the stability region suggests a specific set of detuning conditions for maintaining stability in the oscillatory state. Further investigation regarding the dependence of DOPO with the microring dispersion regime should allow broader stability conditions.
- Gonsalves, Paulo Henrique (State University of Ponta Grossa (UEPG), Brazil): Effects of random configuration in Turing pattern formation
One topic of nonlinear dynamics studies is the diffusion reactions of activator-inhibitor compounds. This type of reaction and its analogues have applications in biology, ecology, physics, and geology, such as tumor growth, ecological invasions, and epidemiological studies. In particular, this type of system underlies morphogenesis, first described by Alan Turing in 1952, responsible for spatial patterns that bear his name (Turing patterns) and that frequently occur in animal fur and some plant leaves. In this study, we are analyzing the effects of different spatial distributions of activator-inhibitor compounds, as well as their magnitude and the randomness with which these regions are initially distributed in a two-dimensional space. The different patterns formed by this reaction are intended to identify and classify initial conditions, as well as the behavior of the system.
- Guarimata, Juan Diego (Universidad nacional de La Plata, Argentina): AI-Driven Generative Pipeline for Drug Discovery in Neglected and Emerging Infectious Diseases
Neglected tropical diseases such as Chagas disease (Trypanosoma cruzi), leishmaniasis (Leishmania spp.), and arboviruses like dengue remain major global health challenges with limited therapeutic options. Our work develops an integrated computational pipeline combining bioassay-driven data mining, machine learning, and generative deep learning approaches to accelerate drug discovery for these pathogens. Initially, bioactivity data from PubChem assays targeting essential enzymes (e.g., topoisomerases, cytochrome P450s, trypanothione reductase in trypanosomatids, and NS5 methyltransferase/polymerase in dengue), as well as phenotypic screening assays, were curated, prioritizing candidate scaffolds. Building on these scaffolds, we implemented a generative chemistry framework based on MolGAN, which produces novel molecular graphs guided by reward functions incorporating Quantitative Estimate of Drug-likeness (QED) and Synthetic Accessibility Score (SAS). The system integrates structural pocket information from crystallographic data, filtering rules for drug-likeness, and iterative evolutionary strategies. Candidate molecules undergo 3D conformer generation and molecular docking with OpenEye OMEGA/FRED, closing an active learning loop where high-scoring ligands are reincorporated into model training. Selected compounds predicted to show improved binding are subjected to molecular dynamics simulations to validate interaction stability. The top candidates will be prioritized for chemical synthesis and experimental evaluation, bridging computational predictions with translational drug discovery. This pipeline highlights the synergy of AI, structural biology, and cheminformatics in delivering tractable leads for neglected and emerging infectious diseases.
- Huancapaza Hilasaca, Liz (Instituto De Ciências Matemáticas e de Computação ICMC -USP, Brazil): CNN Architectures Adapted to Soundscapes
This work presents an analysis of a feature extraction method based on CNN embeddings applied to soundscape data classification through active learning. The proposal aims to highlight the limitations of existing representation-level approaches while exploring the ability of convolutional networks to generate more expressive and robust descriptors. The experiments carried out demonstrate the superiority of some methods over others explored, particularly in scenarios with limited labeled samples, showing consistent improvements in both performance and annotation efficiency. These results reinforce the potential of CNN embeddings as an effective strategy for classification tasks in complex acoustic environments, while also contributing to the advancement of active learning methodologies in contexts with scarce labeling resources.
- Jensen Didonet, Ricardo (Universidade Federal do Rio de Janeiro, Brazil): Accelerating CFD Simulations with Conditional Latent Diffusion Models
Computational Fluid Dynamics (CFD) is a cornerstone of modern science and engineering, enabling the simulation of complex physical phenomena. However, the high computational cost of these high-fidelity simulations presents a significant bottleneck in research and design cycles. While machine learning-based surrogate models offer a promising path to accelerate this process, many traditional architectures struggle to capture the full spatio-temporal complexity and inherent stochasticity of turbulent fluid flows. This research explores the application of state-of-the-art conditional Latent Diffusion Models (LDMs) as a powerful generative surrogate for time-variant CFD problems. Our approach involves a two-stage training process. First, a Variational Autoencoder (VAE) is trained on a large-scale CFD benchmark dataset to learn a low-dimensional, compressed latent representation of the high-dimensional velocity fields. This compression is critical for computational efficiency. Second, a conditional U-Net model is trained within this latent space. The model learns to reverse a diffusion (noising) process, conditioned on the previous time step’s velocity field, boundary conditions, and key physical parameters (e.g., viscosity, inlet velocity). By iteratively denoising a random latent vector, the model can autoregressively generate a sequence of future flow states that are both physically plausible and high-fidelity. This work demonstrates the successful adaptation of a powerful generative model from the computer vision domain to a challenging scientific simulation task, highlighting the potential of conditional LDMs to not only accelerate simulations but also to provide a new, controllable framework for generative scientific discovery.
- Levrino, Micaela (Instituto Balseiro, Argentina): Deep Learning-Based Image Quality Transfer for Low-Field Brain MRI
Magnetic resonance imaging (MRI) at low field strengths has gained renewed interest due to its lower cost, portability, and improved safety profile. However, its clinical adoption is limited by lower image quality compared to high-field MRI. In this work, we present preliminary results of our Master’s thesis in Medical Physics, focused on enhancing low-field brain MRI through deep learning-based image quality transfer (IQT). Our approach leverages paired datasets of high-field and simulated low-field images to train convolutional neural networks capable of improving resolution and signal-to-noise ratio in low-field acquisitions. We show initial experiments where 3T brain MRI volumes were resampled to lower resolutions to simulate 0.36T acquisitions, followed by image enhancement using a 3D U-Net architecture. The results demonstrate improved structural detail and tissue contrast in the reconstructed low-field images, highlighting the potential of IQT to bridge the quality gap and facilitate broader use of affordable MRI systems. These advances represent a promising step toward clinically viable low-field MRI applications, supporting both accessibility and quantitative neuroimaging in resource-limited settings.
- Luz Oliveira, Vinícius (Universidade Federal de Santa Catarina, Brazil): Quantum Fourier Neural Operator in the context of fluid dynamics
Operator learning for surrogate maps of solution operators of differential equations is a fast-developing field within machine learning applications. A highly successful model for this purpose is the Neural Operator – a generalization of neural networks that maps between infinite-dimensional function spaces. With the advent of Quantum Machine Learning methods and NISQ-era optimization techniques, such as Variational Quantum Algorithms (VQAs), creating a quantum counterpart for such models is a promising direction. In this work, we explore the current literature on the topic by implementing, benchmarking, and commenting on models such as Quantum Fourier Layers and Quantum DeepONets. We also outline perspectives for the development of a new quantum model adapting the Physics-Informed Neural Operators – a model that implements physical laws in supervised training.
- Maia, Ariel Moura (INCT-TB – National Institute of Science and Technology in Tuberculosis/TECNOPUC – PUCRS’ Science and Technology Park, Brazil): A Simple Framework to Deconvolve Matrigel’s Spatial and Chemical Variables in a 3D Lung Cancer Model
Background: In drug development research, transitioning from simplistic 2D cell cultures to more physiologically relevant 3D models is critical for improving the predictive power of preclinical studies. However, these 3D systems often rely on biologically-derived scaffolds like Matrigel, which introduces a significant data analysis challenge: it simultaneously provides both physical (spatial architecture) and chemical (bioactive components) signals. This confounding variables makes it difficult to determine the true driver of observed cellular changes, effectively mixing a biological “signal” of interest with a complex source of “noise.” Objective: To develop and apply workflow capable of deconvolving the distinct proteomic contributions of 3D physical architecture versus the confounding biochemical cues from the Matrigel substrate using high-resolution, label-free mass spectrometry data. Methods: We established and compared three distinct culture conditions using the Calu-3 lung cancer cell line: a standard 2D monolayer on plastic (baseline), 3D spheroids embedded within Matrigel (spatial + chemical effects) and a novel intermediate 2D culture grown in contact with Matrigel substrate (the 2M group, isolating chemical effect). A rigorous experimental design, sample preparation and data processing pipeline was implemented to deconvolve Matrigel’s chemical inffluence, involving a multi-stage filtering strategy against a combined human-mouse proteome and a curated list of 312 known Matrigel components. Statistical and data analysis included unsupervised clustering, differential expression analysis (Limma) and a dedicated paired analysis (t-test, Wilcoxon signed-rank, and linear regression) to quantify the subtle influence of the isolated chemical cues. Results: Our pipeline quantified 3,963 high-confidence human proteins. PCA revealed that culture dimensionality was the dominant driver of proteomic variance, clearly separating 3D clusters from 2D and 2M monolayers. Differential analysis identified a core set of 72 differentially enriched and 48 conditionally quantified proteins remodeled by the 3D environment, primarily down-regulating cell-surface and cholesterol biosynthesis pathways while up-regulating iron-associated metabolic processes. While a direct 2D vs. 2M comparison yielded no significantly changed proteins, our paired statistical analysis revealed a key insight: the biochemical cues from Matrigel alone induced a subtle but highly significant proteome-wide shift, moving the cells closer to the 3D phenotype. Conclusion: This work establishes a robust experimental and computational framework for dissecting confounded variables in 3D proteomic datasets that uses biological matrices such as Matrigel. By statistically isolating the influence of the scaffold’s biochemistry, we provide a more precise interpretation of cellular reprogramming in 3D models. This data-centric approach is critical not only for improving the design and interpretation of current precision medicine application but also for providing the foundational data necessary for the rational, data-driven design of clean, synthetic biomaterials.
- Maia, Madalena C. (Observatório Nacional, Brazil): Determination of Atmospheric Parameters for M Dwarf Stars in the Praesepe Open Cluster: A Comparative Study Using the DESI and APOGEE Surveys
M dwarf stars constitute the most numerous stellar class in the Milky Way. However, due to the complexity of their spectra, which are rich in molecular lines, they remain the least studied spectral type in the literature. In this work, we perform a detailed spectroscopic analysis of 19 M dwarf stars belonging to the Praesepe open cluster, using spectra from the Galactic survey SDSS/APOGEE. The spectral modeling was conducted in local thermodynamic equilibrium, focusing on molecular lines of OH and water, through a previously validated methodology that derives atmospheric parameters and metallicities with high precision. We obtained a mean metallicity for Praesepe of <[M/H]> = 0.14 ± 0.07 dex, in agreement with high-resolution optical studies of G and K dwarf stars, as well as red giants, and consistent with the premise of chemical homogeneity in open clusters. We determined that the Praesepe open cluster has <[O/M]> = -0.02 ± 0.03 dex, a result consistent with expectations for an open cluster belonging to the Galactic thin disk. When comparing our results with the parameters derived by the APOGEE ASPCAP pipeline, we find that the uncalibrated values of log g are physically implausible, and that ASPCAP systematically underestimates [M/H] relative to our results. Furthermore, when comparing the stellar parameters obtained in this work with those derived by the SP and RVS pipelines from the DESI Milky Way Survey, we note that, especially for SP, the estimates deviate significantly from values predicted by theoretical models. We identify systematic differences between the RVS results and ours for all analyzed parameters. The SP pipeline derives systematically higher values of [M/H] and log g compared to our determinations, however, its Teff values are consistent with those we determined. We also find that the RVS pipeline retrieves [M/H] values that are independent of the signal-to-noise ratio (SNR) of the DESI spectra in our sample, whereas the SP pipeline shows a trend with SNR, yielding more consistent [M/H] values and lower dispersions for spectra with higher SNR. This work not only provides a precise reference for M dwarfs, but also demonstrates that there are significant issues in the stellar parameter results for M dwarfs in the publicly available APOGEE and DESI data releases. These findings highlight important limitations in current parameter derivation pipelines for this stellar class. Our results may serve as a benchmark for future improvements in the DESI and APOGEE pipelines. Keywords: M dwarf stars, stellar atmospheric parameters, metallicity, open clusters, infrared spectrum, pipelines.
- Martins, Matheus Monteiro Ramalho Poltronieri (Centro Brasileiro de Pesquisas Físicas (CBPF), Brazil): Anisotropic interactions in vortex lattices of active matter
This work investigates the transition between ferromagnetic and antiferromagnetic ordering in vortex lattices of active matter. Using Lattice Boltzmann fluid dynamics simulations, we explored how an external horizontal force affects the collective behavior of these vortices. In the absence of a force, the system transitions from ferromagnetic to antiferromagnetic ordering as the gap width between cavities increases. The application of a horizontal force induces anisotropy, causing spins to exhibit ferromagnetic alignment perpendicular to the force and antiferromagnetic alignment parallel to it. Furthermore, a sufficiently strong force can suppress antiferromagnetic correlations, resulting in a preference for a purely ferromagnetic state. Future work will aim to derive an Ising-like model to better understand the emergence of these force-induced phases at the mesoscopic scale.
- Miranda, Gabriel Resende (UFBA – Universidade Federal da Bahia, Brazil): Bimodality in J-PAS: Optimizing catalogs for star-galaxy classification via Machine Learning
This work aimed to improve the quality of catalogs used in machine learning models for star-galaxy classification within the context of astronomical data provided by J-PAS (Javalambre Physics of the Accelerating Universe Astrophysical Survey). As part of the Data Validation (DAVA) team, our main contribution was cleaning the training and testing catalogs used by team members to develop their classification models, focusing on removing objects identified as bimodal. These objects are formed by nearby sources that SExtractor (a widely used tool for detecting objects in astronomical images) erroneously identifies as a single object. When this happens, particularly in cases where stars are mistakenly identified as galaxies, the performance of classification models can be negatively affected. Our methodology consisted of analyzing individual cutouts for each object in each catalog, applying a pipeline that detects bimodality based on counts profiles along pixel ranges on the x, y axes and diagonals of each image. Multiple peaks detected in any of the profiles indicated the possibility of more than one object present in the cutout. Subsequently, we checked whether the J-PAS database itself already recognized multiple objects in the region of interest; only sources not previously separated were flagged as bimodal. The results showed significant effectiveness in identifying bimodal stars with mag_i < 21, but limitations for galaxies and fainter objects. After this analysis, we decided to remove 904 stars from the training catalog (648 wrongly classified as galaxies) and 244 from the testing catalog (185 false galaxies), which resulted in an increase of ~1% in completeness for both cases, without a negative impact on catalog purity. We concluded that removing bimodal stars improves the data quality for training more robust models, although further enhancements are needed, especially in handling galactic objects with more irregular morphologies.
Thursday – Nov 20
- Muka, Paul (University of Sao Paulo, Brazil): Deploying Deep Learning for Pneumonia Classification
Pneumonia remains a major global health challenge, with high rates of morbidity and mortality, underscoring the need for accurate and timely diagnosis. This study explores the application of deep learning methods for automated classification of chest X-ray images into pneumonia and non-pneumonia cases. The Radiological Society of North America (RSNA) Pneumonia Detection Challenge dataset was employed, comprising 26,684 images, of which 6,012 showed pneumonia. Images were resized from 1024×1024 to 224×224 pixels, and augmentation techniques such as rotation, translation, and scaling were applied to enhance model robustness. A modified ResNet-18 architecture, adapted for single-channel input and binary classification, was implemented using PyTorch Lightning. The model was trained for 35 epochs with the Adam optimizer and binary cross-entropy loss. The results demonstrated strong classification performance, supported by confusion matrix analysis and class activation maps for interpretability. These findings highlight the potential of deep learning to support radiologists by improving diagnostic efficiency and reliability in pneumonia detection
- Olegario, Marcos Vinicius Tomás (University of Sao Paulo – Sao Carlos Institute of Physics, Brazil): Physics-Informed Neural Networks for Event-by-Event Reconstruction of the First Interaction Depth in Extensive Air Showers
The depth of the first interaction (X_first) of extensive air showers (EAS) encodes information about the nature of primary cosmic rays, energetic particles from space that encode signatures of their astrophysical sources. However, X_first is not directly measurable with current detectors, and traditional analytical estimates depend heavily on high-energy hadronic interaction models, which involve significant uncertainties. In this work, we propose a Physics-Informed Neural Network to reconstruct X_first from the longitudinal energy deposit profile of EAS. The model is trained on a dataset of $5 \times 10^6$ Monte Carlo–simulated showers induced by five primary species (proton, helium, carbon, silicon, and iron). To incorporate physics constraints, an autoencoder submodel is used to extract a latent representation together with parameters of the Gaisser–Hillas function that describes the shower development. At the output stage, a custom loss function combines the mean squared error with a physics-informed regularization term accounting for the expected exponential distribution of X_first, a consequence of the mean free path of the primary particle. For proton-initiated showers, the model achieves a Pearson correlation coefficient of 0.960 between predicted and simulated X_first values; for iron-induced showers, the correlation was 0.734. In terms of resolution, the model attains better than 9 g/cm^2 for proton primaries and better than 6 g/cm^2 for iron. These results suggest the feasibility of event-by-event X_first reconstruction using deep learning techniques and provide meaningful insights into the nature of the primary cosmic ray.
- Palma, Christian (Universidad de los Andes, Ecuador): Hybrid Analysis of Bioacoustic Classification and Ecoacoustic Metrics to Evaluate Biodiversity in Jocotoco Reserves
In this study, specialized classifiers based on neural networks were developed using transfer learning from embeddings generated by Perch 2 and BirdNET. Models were trained on acoustic recordings from the Jocotoco Foundation’s protected reserves, achieving improved detection accuracy compared to the default classification performance of Perch 2 and BirdNET. The classifiers cover three main domains: birds, anurans, and anthropophony, enabling ecological monitoring, detection of environmental threats, and biodiversity assessment. Additionally, classifier outputs were integrated with classical ecoacoustic indices (such as ADI, ACI, and acoustic entropy) to construct a Hybrid Acoustic Biodiversity Index, allowing quantitative characterization and comparison of soundscapes. This index provides a framework to evaluate the relative ecological quality of different habitats, offering robust indicators of community richness and complexity. The proposed hybrid approach demonstrates the feasibility of combining automated species classification with ecoacoustic metrics for conservation and biodiversity monitoring in tropical ecosystems.
- Prado Camargo, Arthur (IFUSP, Brazil): Mapping Teaching Internship Profiles Using Clustering and NLP in Public Distance Education
The professional training of future teachers in Brazil faces structural challenges, particularly during the supervised internship stage, which is essential for building pedagogical identity. This issue is intensified in distance education (EaD) models, such as UNIVESP, where mediation is remote but internships take place in person. This study applies data science techniques to analyze 100 teaching internship reports from UNIVESP’s mathematics licensure program. Each report was evaluated by researchers using 14 qualitative indicators, later converted into ordinal scales. We applied clustering algorithms (K-Means, DBSCAN), dimensionality reduction (PCA), and supervised learning (Random Forest, SVM) to explore patterns in the internship experiences. Additionally, we used NLP tools such as topic modeling (LDA) to analyze free-text descriptions of the actions taken by students. Preliminary results reveal the emergence of three distinct internship profiles: reflective and participatory, superficial, and collaborative-innovative. These profiles correlate with school types, education levels, and evaluator tendencies. The results offer actionable insights into the challenges of practical training in EaD contexts and show how machine learning can contribute to improving teacher education programs.
- Quiteque, Daniel (INATEL – NATIONAL INSTITUTE OF TELECCOMUNICATIONS, Brazil): Optimizing Energy Efficiency in IoT Devices Using Machine Learning
Internet of Things (IoT) devices are increasingly integral to smart systems, but their energy consumption poses sustainability challenges. My research focuses on developing machine learning models to optimize energy efficiency in IoT devices, aiming to reduce power consumption while maintaining performance. Using Python-based frameworks like Scikit-learn and TensorFlow, I am exploring supervised learning techniques, such as regression and decision trees, to predict and adjust energy usage patterns in real-time IoT applications. Preliminary experiments on a simulated IoT dataset show promising results, with potential energy savings of up to 15%. This project seeks to contribute to sustainable technology by integrating AI-driven solutions into IoT ecosystems. At the 4th School on Data Science and Machine Learning, I aim to refine these models using advanced techniques like foundation models and generative AI, fostering discussions on AI’s role in sustainability.
- Rodrigues Soares, Vinícius (USP, Brazil): Comparing SuStain and K-means for Alzheimer’s Conversion Prediction with Multimodal Biomarker Profiling of MCI
Title: Comparing SuStain and K-means for Alzheimer’s Conversion Prediction with Multimodal Biomarker Profiling of MCI Authors: SOARES, V. R.¹; CHIARI-CORREIA, R. D.¹; SALMON, C. E. G.¹ Filiation: ¹Department of Physics, Faculty of Philosophy, Sciences and Letters, University of Sao Paulo Background: Mild cognitive impairment (MCI) represents a prodromal stage of Alzheimer’s disease (AD), characterized by cognitive decline that does not yet interfere with activities of daily living. Some studies suggest the presence of subtypes of AD and MCI based on patterns of brain atrophy or cognitive deficits. While more recent work has employed machine learning approaches to integrate multimodal data and delineate subgroups with distinct risk profiles for AD conversion. To overcome this limitation, the present study compares MCI subtypes derived from two approaches, a widely used unsupervised clustering method (k-means) and Subtype and Stage Inference (SuStaIn), a machine learning framework specifically designed to infer both subtypes and disease progression stages from cross-sectional biomarker data. Objective: Compare two ML approaches to predict MCI-to-AD conversion using multimodal data. Methods: We analyzed 558 MCI and 215 CN subjects from ADNI with longitudinal neuropsychological (NPS), CSF (β-amyloid/pTAU), and MRI-T1w volume data (FreeSurfer-processed). For modeling, eight biomarker combinations were used (Hippocampus, ABETA, PTAU, WM_hypointensities, Ventricles, RAVLT_forgetting, TRABSCOR, GDTOTAL). For this work we clusterized MCI subjects by k-means and SuStaIn, the number of subtypes was optimized by Statistical gap and CVIC, respectively, after 10-cross validation folds.Finally, Conversion rates at one-two years and clustering similarities were tracked. Results: Clustering analysis using k-means suggested six optimal subtypes of mild cognitive impairment (MCI), whereas the SuStaIn model identified five as optimal. Despite this difference, both methods showed a high degree of overlap across corresponding subgroups. Specifically, k-means Cluster 0 corresponded to SuStaIn Subtype 2 (66.77%), Cluster 1 to Subtype 1 (83.05%), Cluster 2 to Subtype 2 (80.00%), Cluster 3 to Subtype 3 (86.27%), Cluster 4 to Subtype 4 (67.27%), and Cluster 5 to Subtype 0 (65.20%). Interestingly, k-means Cluster 4 grouped a higher proportion of Alzheimer’s disease (AD) subjects (71.8%) compared with SuStaIn Subtype 4. Conversely, k-means Cluster 5 exhibited fewer AD converters compared to SuStaIn Subtype 0. Furthermore, SuStaIn Subtype 2 appeared to be subdivided into two distinct groups by the k-means approach, suggesting finer resolution of heterogeneity. Conclusion: Although the k-means and SuStaIn methods indicate different numbers of MCI subtypes, they show high similarity in the distribution of individuals across the corresponding clusters/subtypes. However, further analyses are needed to delve deeper into these differences, similarities, and clinical contributions.
- Rodrigues, Ailton José (Informatic Center [CIn] of the Federal University of Pernambuco [UFPE] and IFPI [Institute Federal of Piauí ]], Brazil): BrasaGPT: A customized LLM for Forest Fire Analysis in Brazil.
The recent advancement of large language models (LLMs) represents a transformational capability at the frontier of artificial intelligence. However, LLMs are generalized models trained on large corpuses of text and often struggle to provide context-specific information, especially in areas that require specialized knowledge, such as details about wildfires in Brazil within the broader context of climate change. For decision-makers focused on resilience and adaptation to wildfires, obtaining answers that are both accurate and domain-specific is crucial. To this end, we will develop BrasaGPT, a prototype LLM agent designed to transform user queries into actionable insights about wildfire risks in Brazil. We will enrich BrasaGPT by providing additional context, such as climate projections and scientific literature, to ensure its information is current, relevant, and scientifically accurate. This allows BrasaGPT to be an effective tool for providing detailed, user-specific insights into wildfire risks in Brazil, supporting a diverse set of end users, including, but not limited to, researchers, forestry engineers, and fire prevention and control agencies, such as the Brazilian National Center for the Prevention and Control of Forest Fires (PrevFogo), to generate positive impact and make decisions.
- Rodrigues, Amancio Henrique Damasceno (Unicamp, Brazil): Talking to a metformin patient package insert
Medication package inserts are mandatory technical–scientific documents prepared by manufacturers and distributed with medicines. Their primary purpose, in the version intended for patients, is to provide essential information to the user. Both national and international literature indicate that these documents often feature excessively technical language, long sentences, complex structure, and scientific jargon, which make them difficult for lay readers to understand. This lack of comprehension can represent a significant barrier to the safe and correct use of medicines. With recent advances in Artificial Intelligence, especially in the field of Natural Language Processing, new possibilities have emerged to address these challenges. The use of Large Language Models (LLMs) combined with vector-based information retrieval enables the development of systems capable of locating information in a package insert and leveraging the adaptive capacity of AI models to provide personalized answers to users’ questions. Objective: The aim of this work was to create a system that uses Retrieval-Augmented Generation (RAG) and a Large Language Model to perform question–answering on a metformin package insert. Method: A metformin patient leaflet was randomly selected from all inserts containing the substance. The leaflet underwent an indexing stage consisting of splitting the text into chunks and transforming them into embeddings using pre-trained language models. The vectors were organized in a FAISS (Facebook AI Similarity Search) database to enable fast semantic search. The RAG pipeline was implemented so that, upon receiving a user question, it performs a search in the vector library and retrieves relevant excerpts (context). A Large Language Model (gpt-4o-mini) then generates an answer based on the question and the retrieved context. The prompt given to the model included instructions to use plain language, present the information in bullet points, and adopt a welcoming tone. Thirty-five questions were submitted to the system and evaluated for response time, coverage (the proportion of words in the answer that were present in the retrieved context), number of “not found” responses, and number of direct citations from the leaflet. Results: The average response time was 11.91 seconds, indicating the feasibility of using such a system to provide real-time guidance. Of the 35 questions, the system failed to answer 8. Among these 8, four were related to the physical appearance of the metformin tablet—information that is actually described in the leaflet. The model was also unable to provide the manufacturer’s toll-free (0800) number. The mean coverage was 42.4%, and there were 33 direct citations from the leaflet. Conclusion: The results demonstrate that employing these techniques is a promising strategy to provide information and improve comprehension. Adjustments are still needed regarding indexing and semantic search to reduce failures in specific queries. Further studies should be conducted to assess the applicability and effectiveness of the system in real-world contexts.
- Santos, Guilherme Da Silva (IFUSP, Brazil): First-Principles Approaches to Materials Design for Electrolyzers
In line with global warming, the search for new, less polluting forms of energy production has become increasingly important [1]. Hydrogen stands out as a viable fuel alternative and can be efficiently used in fuel cells. Devices such as electrolyzers are responsible for its production [2]. However, challenges remain, including the high cost and scarcity of metals like platinum, which, despite having favorable catalytic properties, raise concerns about long-term sustainability [3]. In this context, collaborators performed a screening for the hydrogen evolution reaction (HER) on bimetallic surfaces and applied Natural Language Processing (NLP) to identify interesting materials candidates for electrolyzers. Density Functional Theory (DFT) [4] is a powerful tool for studying catalysis, and this project uses it to deepen the understanding of materials identified through screening and NLP. A DFT study is being conducted on bimetallic alloys to explore ways to enhance their efficiency by analyzing hydrogen adsorption energy and other electronic properties at their active sites. These bimetallic combinations will be investigated in different material forms such as surfaces, nanoparticles, and supported clusters. For the promising alloy, Pt3Ni, bulk optimization of Pt, Ni, and Pt₃Ni has been completed, along with convergence studies for energy and volume optimization, which were validated against literature data. Subsequently, magnetization and density of states (DOS) were calculated. Different Miller index surfaces of Pt and Pt₃Ni have been modeled for nanoparticle construction, and convergence studies for vacuum and layer thicknesses are currently underway. The study aims to provide key insights for developing more efficient and sustainable hydrogen evolution catalysts, advancing the performance and feasibility of electrolyzers. Reference 1: IPCC. (2021). Climate Change 2021: The Physical Science Basis. Contribution of Working Group I to the Sixth Assessment Report of the Intergovernmental Panel on Climate Change, V. Masson-Delmotte et. al. Cambridge University Press. Reference 2: WAPPLER, Mona. Building the green hydrogen market e Current state and outlook on green hydrogen demand and electrolyzer manufacturing. International journal of hydrogen-energy, [s. l.], 2022. Reference 3: BALETTO, Francesca; R. MIRANDA, Caetano; A. RIGO, Vagner; ROSSI, Kevin. NANOALLOYS FOR ENERGY APPLICATIONS. In: FUNDAMENTALS to Emergent Applications. [S. l.: s. n.], 2020.
- Silva, Rodrigo (UFSCar (Universidade), Artefact (Empresa), Brazil): Physics-Informed Neural Networks for Quantum State Prediction and Parameter Discovery
Physics-Informed Neural Networks (PINNs) offer a way to embed physical laws into machine learning models, providing an alternative to traditional solvers of quantum systems. This work applies PINNs to the Jaynes–Cummings Hamiltonian with two objectives: predicting quantum state dynamics and identifying physical parameters. Preliminary results show that PINNs can capture key dynamical features while preserving physical consistency, suggesting their potential as efficient tools for quantum modeling.
- Tormet Gonzalez, Sinkler Eduardo (UNICAMP, Brazil): Integrating Machine Learning and Experimental Validation for Drug–Excipient Compatibility Studies
Drug–excipient compatibility is a critical factor in pharmaceutical formulation, directly impacting drug stability, efficacy, and safety. Traditional compatibility studies rely on experimental methods such as Differential Scanning Calorimetry (DSC), Fourier Transform Infrared Spectroscopy (FTIR), and High-Performance Liquid Chromatography (HPLC). While effective, these methods are time-consuming, resource-intensive, and often limited in scalability. In this project, we propose an a platform to accelerate compatibility studies by combining Natural Language Processing (NLP) with Machine Learning (ML). Relevant data is extracted from unstructured sources such as scientific publications, patents, and technical reports, then curated into a structured database. This database serves as the foundation for predictive ML models designed to assess drug–excipient interactions with high accuracy. A parallel web-based interface is being developed to make these predictive models accessible to researchers and industry professionals, enabling rapid, data-driven decision-making in formulation design. Experimental validation using FTIR, LC-MS/MS, and NMR will complement computational predictions, ensuring robustness and reproducibility. By integrating automation, predictive modeling, and experimental validation, this work aims to reduce costs and time associated with traditional testing, ultimately accelerating innovation in drug development and improving pharmaceutical quality control.
- Vianna, Vinicius Serra (Universidade Estadual de Campinas, Brazil): Contact Force Modeling through Symbolic Regression – Theory and Experimentation
This project investigates the dynamics of an impact between a rotor and a stator. A test rig, developed and refined over several previous studies, serves as the experimental platform for analyzing rotor/stator contact events. To improve the accuracy and completeness of the measurements, two additional accelerometers were installed in the setup, allowing acceleration data to be collected in all directions within the impact housing. The experimental campaign focused on capturing detailed dynamic responses during contact events under different operating conditions. The collected acceleration data was processed to estimate contact forces and characterize the system behavior during impacts. This rich dataset provided a reliable foundation for both data-driven and theoretical modeling approaches. Using a symbolic regression algorithm, a contact force model was proposed based on the experimental data. Symbolic regression, known for its ability to generate interpretable mathematical expressions, was employed to identify an analytical relationship that describes the contact force as a function of measurable system variables. The resulting model reflects the complex nonlinear behavior observed during rotor/stator impacts. A comparative analysis was conducted between the symbolic regression model, the experimental measurements, and fifteen theoretical constitutive equations commonly used in contact force modeling. The results demonstrate that the symbolic model achieves high agreement with the experimental data, while also capturing features not fully represented by the traditional models. This highlights the potential of symbolic regression as a powerful tool for modeling complex physical interactions in rotating machinery.
- Vicente, João Gabriel (Universidade Estadual de Londrina, Brazil): Reconstructing the Local Expansion Rate with Anisotropic and Incomplete Sky Coverage.
We infer the first three multipoles of the local expansion rate using the Cosmicflows-4 catalog, a heterogeneous compendium of redshifts and distance moduli of galaxies in the nearby Universe. We focus on galaxies within the spherical shell of 30h^{-1} — 150h^{-1} Mpc. To account for the anisotropic sky coverage and the systematic lack of data along the Galactic Plane, we apply a pixelization and masking procedure, where empty or inhomogeneously sampled pixels are masked. The full-sky map is then reconstructed via a maximum-likelihood estimation of the masked harmonic coefficients. We validate this methodology by performing several simulations of models with different levels of anisotropy, statistical noise, sample distributions, and sky coverage. In the context of the Lambda-CDM model, our preliminary results indicate a bulk flow of (174 \pm 13) km s^{-1} directed towards Galactic coordinates (l, b) = (293°, -2°) \pm (4°, 2°). Additionally, the quadrupole and the octopole are roughly ten times less intense than the dipole. We find that, while the quadrupole and octopole are consistent with Lambda-CDM predictions, such an intense dipole has only a 0.8% chance of occurring in a perturbed Lambda-CDM cosmology. These results are consistent with previous studies, providing an independent inference.
- Waehner, Nicolas (University of Buenos Aires, Argentina): Machine learning for exoplanet detection using the radial velocity method
Only 30 years ago, the first exoplanet orbiting a Sun-like star was discovered. To date, more than 5900 have been detected, and the number continues to grow rapidly thanks to technological advances. The radial velocity (RV) method has proven to be one of the most successful and promising techniques for detecting planets through the motions they induce on their host stars. While instrumental improvements have enabled the measurement of increasingly smaller velocity variations, stellar activity and irregular sampling can make the detection of planetary signals more difficult and lead to false positives. For this reason, machine learning techniques have recently begun to be explored to address this challenge. In this thesis, we develop a convolutional neural network with an attention layer to detect planetary signals in Sun-like stars, using simulated RV measurements generated over observation calendars representative of exoplanet searches. The network achieves 54% fewer false positives than the traditional null-hypothesis-based approach, without increasing the number of false negatives. This improvement is mainly concentrated in low amplitude signals, associated with low mass planets. In addition, the attention layer weights were analyzed to identify which regions of the input the model prioritizes during classification, revealing a correlation between these weights and the network’s predictions. The method was further evaluated on 159 real signals from stars with at least one confirmed planet and achieved correct classifications in most cases. These data were also used to perform fine-tuning on the network, enhancing its detection capability on real observations. Overall, these results highlight the potential of neural networks as a promising tool for the detection of planetary signals in radial velocity measurements.
Program
Sunday (Nov 16): Toolbox Kickoff
Morning:
- 10:00-10:30: Registration
- 10:30-12:30: Pandas – Data Wrangling 101 – Raphael Cobe
Afternoon:
- 2:00-3:30: Matplotlib – Data Visualization – Raphael Cobe
- 3:30-4:00: Coffee break
- 4:00-5:30: PyTorch – Tensors & Computation – Raphael Cobe
Monday (Nov 17): Foundations of Neural Networks
Morning:
- 9:00-09:30: Welcome and introduction to the school – Raphael Cobe
- 9:30-10:30: Statistical Methods for Data Analysis I- Raphael Cobe
- 10:30-11:00: Coffee break
- 11:00-12:30: Statistical Methods for Data Analysis II – Raphael Cobe
Afternoon:
- 2:00-3:30: Neural Networks 101: Historical development and basic concepts – Renato Vicente
- 3:30-4:00: Coffee break
- 4:00-5:30: The Perceptron: Building blocks of neural networks (intuitive, visual approach) – Renato Vicente
- 5:30-6:30: Interactive session: Building a simple perceptron using visual tools (no coding) – Raphael Cobe
Tuesday (Nov 18): Going Deeper with Neural Networks
Morning:
- 9:00-10:30: From Single to Multi-Layer Networks – Renato Vicente
- 10:30-11:00: Coffee break
- 11:00-12:30: Backpropagation Explained – Renato Vicente
Afternoon:
- 2:00-3:30: Activation functions and network training – Renato Vicente
- 3:30-4:00: Coffee break
- 4:00-5:30: Convolutional Neural Networks (CNNs) I: Understanding images – Alexandre Simões
- 5:30-6:30: Poster Session
Wednesday (Nov 19): Specialized Architectures
Morning:
- 9:00-10:30: Convolutional Neural Networks (CNNs) II: Understanding images – Alexandre Simões
- 10:30-11:00: Coffee break
- 11:00-12:30: Recurrent Neural Networks (RNNs): Processing sequences and time series – Alexandre Simões
Afternoon:
- 2:00-3:30: Transformers: The architecture behind modern AI – Alexandre Simões
- 3:30-4:00: Coffee break
- 4:00-5:30: Large Language Models and generative AI: Concepts and implications – Alexandre Simões
- 5:30-6:30: Poster Session
Thursday (Nov 20): Applications and Domain-Specific Uses
Morning:
- 9:00-10:30: The Role of HPC in AI – TBC
- 10:30-11:00: Coffee break
- 11:00-12:30: AI for the Climate Emergency – Fabio Ortega
Afternoon:
- 2:00-3:30: Universidade: Caminhos para a Era da IA – Sergio Novaes
- 3:30-4:00: Coffee break
- 4:00-5:30: AI in life sciences – Carolina Gonzalez
- 5:30-6:30: Poster Session
Friday (Nov 21): Frontiers and Future Directions
Morning:
- 9:00-10:30: AI Benchmarks – Felippe Alves
- 10:30-11:00: Coffee break
- 11:00-12:30: Multi-agent LLM system – Davi Bastos Costa
Afternoon:
- 2:00-3:30: Explainable AI – Ana Carolina Lorena
- 3:30-4:00: Coffee break
- 4:00-5:30: “Ethics in AI” – Rafael Zanatta
- 5:30-6:00: Closing
The schedule might be changed.
Venue
Venue: The event will be held at IFT-UNESP, located at R. Jornalista Aloysio Biondi, 120 – Barra Funda, São Paulo. The easiest way to reach us is by subway or bus, See arrival instructions here.
Accommodation: Participants whose accommodation will be provided by the institute will stay at Hotel Intercity the Universe Paulista. Hotel recommendations are available here.
Attention! Some participants in ICTP-SAIFR activities have received email from fake travel agencies asking for credit card information. All communication with participants will be made by ICTP-SAIFR staff using an e-mail “@ictp-saifr.org”. We will not send any mailings about accommodation that require a credit card number or any sort of deposit. Also, if you are staying at Hotel Intercity the Universe Paulista, please confirm with the Uber/Taxi driver that the hotel is located at Rua Pamplona 83 in Bela Vista (and not in Jardim Etelvina).
Additional Information
BOARDING PASS: All participants, whose travel has been provided or will be reimbursed by ICTP-SAIFR, should bring the boarding pass upon registration. The return boarding pass (PDF, if online check-in, scan or picture, if physical) should be sent to secretary@ictp-saifr.org by e-mail.
Visa information: Nationals from several countries in Latin America and Europe are exempt from tourist visa. Nationals from Australia, Canada and USA are required to apply for a tourist visa.
Poster presentation: Participants who are presenting a poster MUST BRING A PRINTED BANNER . The banner size should be at most 1 m (width) x 1,5 m (length). We do not accept A4 or A3 paper.
Power outlets: The standard power outlet in Brazil is type N (two round pins + grounding pin). Some European devices are compatible with the Brazilian power outlets. US devices will require an adapter.




