top of page
Engrenagem de relógio

ARXIV TEMPORAL GRAPH

Silas Liu - July 08, 2025

Graph Algorithms

Graphs are powerful tools for modeling the evolution of complex systems, and scientific research is no exception. In this project, I created a temporal graph of arXiv papers from 2017 to 2024, capturing how the AI and Machine Learning landscape has transformed over time. This dynamic structure allowed me to analyze the progression of ideas, methods and research trends across years. 

​​

Building on this foundation, I applied a diverse set of graph-based and data science techniques, including time-weighted PageRank, topic co-occurrence networks, joint structural-semantic embeddings and temporal clustering. This combination of approaches highlights the versality of temporal graphs for extracting meaningful patterns and tracing how influence, themes and communities evolve within a rapidly growing scientific domain.

graph_01.png

​​In this project, I built a temporal graph representation of all papers published on arXiv between 2017 and 2024, focusing on the main categories related to Data Science. The graph serves as the foundation for various types of analysis using graph algorithms, temporal techniques, and NLP methods.

 

As shown in the plot, there has been a sharp increase in the number of papers in the categories cs.CL (Computation and Language) and cs.LG (Machine Learning) starting around 2018, with a notable acceleration from 2020 onward. This trend closely mirrors the community’s growing focus on large language models (LLMs), sparked by the widespread adoption of transformers and the breakthrough impact of models like BERT and GPT. The surge in cs.CL submissions reflects the explosion of research in NLP driven by LLMs, while cs.LG captures the broader methodological innovations powering them.

​The core of our structure revolves around the Paper node, representing each arXiv article. Each paper includes key attributes such as title, abstract and published date, and connects to other entities like Category, Concept and Topic, which enrich our ability to extract context and identify emerging research trends.

​

Categories follow the official arXiv classification. Topics are assigned by arXiv and organized hierarchically into Subfield, Field and Domain. Concepts are sourced from OpenAlex, using algorithmic tagging based on content similarity and metadata analysis.

​

A critical relationship in this graph is the citation network, captured by CITES edges that represent references between papers, complete with temporal metadata. These connections are central to our influence and impact analysis. While internal arXiv citations connect two Paper nodes, we also model references to/from external sources, such as journals, books and conference proceedings, via ExternalPaper nodes. Below schema shows our graph structure.​

Captura de tela 2025-06-01 150040.png

In our graph we have a total of 1,910,542 nodes and 7,333,115 relationships. Below we have the detailed main entities.

Node
Count
Paper
173,622
ExternalPaper
1,718,313
Category
5
Concept
15,627
Topic
2,677
Subfield
233
Field
26
Domain
4
Relationship
Count
CITES
2,603,588
IN_CATEGORY
173,622
HAS_CONCEPT
2,372,302
HAS_TOPIC
424,213
PART_OF_SUBFIELD
2,677
PART_OF_FIELD
233
PART_OF_DOMAIN
26
Time-Weighted Ranking

​​​​​​​

My first analysis involves applying the PageRank algorithm to the citation graph to identify the most influential papers in Data Science. Originally developed by Google for ranking web pages, here PageRank treats papers as nodes and citations as edges. A paper's influence is determined not just by how often it is cited, but also by the influence of the papers that cite it, a recursive model that captures academic reputation in depth.

​

To reflect the dynamic nature of scientific relevance, I enhanced the standard PageRank with a time-decay function that gradually down-weights older citations. This approach ensures that recent citations carry more weight, aligning the influence score with current research frontiers. The result is a time-aware PageRank that highlights both foundational contributions and actively influential work. Since our graph links papers to hierarchical topics and concepts, this influence metric can be extended to multi-level structures, allowing me to trace impact not just at the paper level, but across entire research domains.

Out of the 2024 top 20 papers extracted from PageRank, six focus on transformer-based language models: Attention is All You NeedBERTRoBERTaXLNetT5 and DistilBERT. Their dominance reflects how the late-2020 surge in LLM research reinforced the centrality of attention-based architectures. While more recent models like GPT-3 emerged later, their foundations are rooted in this 2017-2019 wave. The enduring presence of multiple BERT variants in 2024 underscores how transformer pretraining schemes remain deeply embedded in the field. Meanwhile, papers on infrastructure and tool, such as PyTorchUMAPEfficientNet and PPO round out the list, reflecting the dual engine of progress: transformative model architectures and robust enabling technologies.​

To understand how influence shifts at the conceptual level, I aggregated the yearly top 20 papers by their concept tags and tracked their cumulative PageRank from 2017 to 2024. This lens reveals the changing prominence of ideas across time.

​

Several concepts saw early influence fade: Recurrent Neural Networks (RNN), SemEval and Lossless Compression were notable in 2017-2018 but disappeared from recent rankings. RNN, once the backbone of NLP, were overtaken by more scalable transformer models. Similarly, SemEval tasks and classical compression methods gave way to neural benchmarks and pretraining-driven pipelines.

​

By contrast, concepts like Transformer, BLEURepresentation (politics) and Code (set theory) rose steadily after 2021. The transformer's ascent reflects the LLM boom and BLEU remains vital for evaluating generative models. The rising attention to representation politics mirrors growing interest in fairness and bias in AI. Meanwhile, Code (set theory) connects to the growing emphasis on code generation and symbolic reasoning, increasingly relevant in models like Codex and AlphaCode.

​

Across the entire period, the most consistently influential concepts are TransformerBLEUMNIST databaseDeep Neural Networks and Code (set theory). While the former highlight the rise of LLMs and code-focused tasks, MNIST and DNNs represent the enduring relevance of foundational benchmarks and architectures that continue to shape deep learning.

​

This concept-level evolution complements the paper-level view, tracing how the field's center of gravity shifted from classical NLP and early computer vision to transformers, automated code synthesis and socially-aware AI systems.

​

Co-occurrence Topics Graph​​​​​​​

To complement the influence analysis, I constructed a co-occurrence graph based on topic annotations from the top PageRank papers over the years. Each node represents a unique research topic and an edge between two nodes indicates that those topics appeared together in at least on paper. Edge thickness corresponds to frequency of co-occurrence, offering a lens into how research areas cluster and evolve.

​

Using a community detection algorithm, I identified six prominent topic clusters, each identified by a color in the graph. One of the densest and most central revolves around Topic ModelingNatural Language Processing TechniquesSpeech Recognition and Synthesis and Multimodal Machine Learning Applications. This grouping reflects a strong deep learning and NLP focus, likely aligned with the growing influence of LLMs and the dominance of text-based modeling in recent research.

​

There are some small cluster linking topics such as Advanced Vision and Imaging3D Shape Modeling and Robotics and Sensor-Based Localization, pointing to robust thematic niches around computer vision and robotics. These areas show concentrated co-occurrance, especially in applied domains such as autonomous systems and medical imaging.

​

To identify cross-cutting topics, I applied betweenness centrality and found that Topic Modeling and Generative Adversarial Networks and Image Synthesis act as key bridges across clusters. They are followed by Natural Language Processing TechniquesMachine Learning and Data Classification and Neural Networks and Applications, topics that not only dominate locally but also connect otherwise distinct subfields.

​

The resulting structure is not uniform: it features a dominant connected component encompassing most mainstream topics, along with smaller, more peripheral groups. These outliers may signal either niche domains or emerging areas still integrating with the broader research ecosystem.

​

Altogether, the co-occurrence graph offers a structural and semantic snapshot of the field's current organization. It illustrates how certain domains, especially NLP, have consolidated around shared methods and applications, while others remain more specialized or cross-disciplinary. As a tool, it serves both retrospective analysis and forward-looking exploration, helping uncover bridges, blind spots and opportunities for novel intersections.

​

Temporal Embedding​​​​​​​
umap_animation.gif

In order to understand the evolution of research in artificial intelligence and machine learning over the years, I combined graph-based embeddings of the structure with text-based embeddings of papers titles and abstracts. Graph embeddings were computed monthly via bibliographic coupling using Node2Vec algorithm to generate structural embeddings of size 64 dimensions, capturing local citation-based proximities. In parallel, text embeddings were created using the SBERT (Sentence-BERT) model, encoding paper's title and abstract into a 768 dimensional semantic vector. These were reduced to 64 dimensions via PCA to align dimensionality and finally concatenated into a unified 128 dimensional embedding. This fusion allowed for simultaneous structural and semantic representation, while also preserving the temporal aspect.

​

To identify topic shifts and evolution over the time I applied HDBSCAN. Clustering was done independently for each month, allowing the detection of changes and divergences over time. To visualize how the clusters evolved, I used UMAP, which reduces the joint embeddings to two dimensions while preserving the local structure of the data. This allowed for the animated plot showing how clusters appear, grow, merge and disappear over the months. To better understand each cluster, I ran TF-IDF analysis over the abstracts, extracting the most representative terms for each group. This methodology provided a clear picture of how research areas have developed, diverged and reconnected across time.

​

Over the years, the evolution of AI research revealed a dynamic interplay between core techniques, such as neural networks, graph models, reinforcement learning, transformers and emerging applications. Early years focused heavily on deep learning and optimization, gradually expanding into graph neural networks (GNNs), natural language processing (NLP) and later into foundational and large language models (LLMs). From 2019 onwards, the number and complexity of research directions increased notably, with new clusters appearing more frequently. Key shifts included the rise of transformer-based models, the spread of graph-based methods and the growing influence of diffusion models and LLMs like GPT-3. By 2022-2024 thematic diversity reached its peak, encompassing not just model architectures but also areas like MLOPs, multimodal learning and interpretability, reflecting a more mature and application-driven landscape.​​

​

2019: Divergence of Architectures

​​​​​​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

The first major thematic divergence emerged in May 2019, as graph neural networks (GNNs) began to crystallize into a coherent subdomain, separating from more general neural network and training-focused papers. Cluster 1 in this month featured strong GNN-specific terms like gnns, graphs, and classification, indicating the community’s rising interest in applying graph structures to supervised learning tasks.

​

In the months that followed, particularly September through November, natural language processing (NLP) began to dominate with clusters centered on language, translation, and NMT (neural machine translation). This aligns with the rising attention to transformer models, fine-tuning, and pretraining strategies, signaled by the presence of BERT, GPT-2, and early translation models. Simultaneously, a separate stream of vision and deep learning research remained active, reinforcing the field's architectural diversification.

​

2020: Graph ML and COVID-19 Influence

​​​​​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

By mid-2020, graph learning became further entrenched as a primary cluster theme, often appearing alongside reinforcement learning (RL). In June, GNNs were clearly distinct from neural network generalists and policy gradient learners, with clusters including gnns, node, and graphs. This signals the rise of GraphSAGE and attention-based graph models like GAT.

 

September through December witnessed an influx of biomedical NLP and COVID-19–related research. Language model clusters included dialogue, trained, and pre (indicating pretraining), while a separate stream centered on rl, state, and regret, reflecting reinforcement learning applications to decision-making in dynamic environments. These trends point to AI’s deployment in biomedical settings, just-in-time modeling, and information retrieval under pandemic constraints.

​

2021: The Rise of Diffusion and CLIP

​​​​​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

The first half of 2021 marked an inflection point: new modalities like diffusion models (DDPMs) and multimodal transformers (e.g., CLIP) emerged, evidenced by thematic divergence in clusters involving rl, gnns, and language translation. The March and April clusters emphasize reinforcement learning, graph-based modeling, and optimization strategies.

 

In the second half of the year, the influence of Vision Transformers (ViTs) and SimCLR-style contrastive learning became apparent. September through December featured clusters distinguishing speech and graph applications from policy and RL research. The rise of papers on learning representations across modalities further broadened the methodological base.

​

2022: Foundational Models and Modal Expansion

 

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

The early months of 2022 continued the diversification, with speech and vision remaining active, and GNNs frequently appearing as a standalone cluster. Interestingly, March and April showed strong representation of neural networks and deep learning, suggesting renewed interest in optimization and model scaling.

 

By May and October, the thematic complexity reached a peak. May brought LoRA and parameter-efficient finetuning into the conversation, while October's four clusters captured an ecosystem of diffusion, RLHF, graph + LLM hybrids, and MLOps. October’s standout terms, reasoning, rl, policy and instruct, suggest this was a pivotal moment preceding the public release of ChatGPT, with researchers already exploring instruction tuning and offline RL in tandem.

​

2023: The GPT-4 Epoch

​​​​​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

March 2023 signaled GPT-4’s entry into the literature, and clusters including chatgpt, gpt, and llms confirm its swift impact. From May through July, the dominance of LLMs continued, now split between open-source efforts (e.g., LLaMA) and evaluation papers emphasizing human, performance and tasks.

​

October 2023 showed unprecedented diversity, with four clusters spanning GNNs, RLHF, multimodal learning, and retrieval-augmented generation (RAG). Keywords like reasoning, offline, and regret illustrate the growing complexity of LLM deployment and optimization, including emerging agent-style architectures.

​

2024: From Retrieval to Federated Learning

​​​​​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

In early 2024, cluster themes reflect RAG techniques and time-series forecasting, the latter appearing in series forecasting and temporal. This duality suggests a shift toward applied AI, with LLMs being integrated into production systems requiring semantic search and temporal modeling.

 

Later months, particularly May to November, demonstrate how enterprise-grade LLM deployments began shaping the literature. Frequent mentions of fl (federated learning), clients, and training highlight a growing focus on distributed optimization, privacy, and edge inference. This trajectory reveals that AI research is now balancing model quality with engineering constraints.

​​

​

Conclusion

​

Graphs are a powerful tool for modeling, analyzing and exploring complex systems. This project demonstrates that potential by representing papers, citations, topics and concepts as an interconnected temporal graph structure. By organizing the arXiv dataset into a rich graph spanning from 2017 to 2024, I was able to create a foundation that supports multiple layers of analysis. The temporal nature of the graph allowed for not only static insights, but dynamic views into how research areas emerged, connected and evolved over time.​

​

Throughout the project, I applied a variety of graph algorithms and data science techniques to extract structure from complexity. These included time-weighted PageRank to surface influential works, co-occurrence networks to reveal topic groupings and graph-based embeddings to model structural proximity between papers. I combined these with semantic embeddings from SBERT, applying dimensionality reduction (PCA, UMAP), clustering (HDBSCAN) and interpretability tools like TF-IDF and community detection. Together, these methods ilustrate how graph-based approaches enable multidimensional exploration of scientific data, highlighting connections, transitions and patterns that would be difficult to uncover using traditional pipelines.

cluster_2019_05.png
cluster_2020_06.png
cluster_2021_04.png
cluster_2022_10.png
cluster_2023_10.png
cluster_2024_05.png
bottom of page