Rishi Nopany

January 19th, 2025

On-Disk Graph Database Extension

An on-disk graph database extension for BuzzDB. Created as an extra-credit project for CS 6422 at Georgia Tech.

Inception

This is an extra-credit project that I made for a graduate class I took at Georgia Tech (CS 6422). The GitHub repository that has this project can be found here.

The addition of a graph database extension to BuzzDB is driven by the need to be able to efficiently store and retrieve complex, connected data. This type of database is especially essential for applications such as social media networks, recommendation systems, and graph-centric domains.

Initially, I had implemented an adjacency list to represent the graph in the form of an unordered_map. An adjacency list is used for graph storage due to its capability to allow for rapid traversals and direct relationship lookups, which is more effective compared to a traditional relational database that stores tuples in different tables. However, after further research, I realized that for the purpose of my implementation, which was mainly to be used by a social media network of some kind, it would be better to represent the graph in the form of an adjacency matrix. So, I decided to implement an adjacency matrix instead. Furthermore, I decided to change one of my 100% goals to make the entire graph on-disk. I realized during implementation that a dense graph network, consisting of many nodes and edges, will be difficult to store in memory, and might result in the user running out of memory.

As a result, I decided to create an on-disk graph. In this graph, each node and each edge is stored in one individual page. This makes it a little more inefficient to fetch the node data and the edge data from disk, but it allows us to have a graph with lots of information stored inside it, since we don’t have to worry about running out of memory.

Design

The image above illustrates a social media network represented by a graph database that uses a property graph model, which is what I have been able to implement during this project. In this graph model, we can store up to 10 different properties in each node and each edge (due to limitations in the page size), and we are able to store up to 180 nodes (again, due to limitations with the page size). As you can see, each node (post and user) and each edge has an associated page_id that is used to fetch the information for that node or edge from the disk. In addition, there is functionality to store property values in the form of integers, floats, and strings. I have also added functionality to obviously create a node and edge and flush it to disk, as well as several operations such as adding properties and removing properties, with appropriate error handling. All of the data in the graph can be stored and retrieved using helper functions that have been defined and documented in the GitHub repository. In addition, I added functionality to allow the edges in the graph to be directed or undirected. Directed edges can be useful when outlining when a node is, for example, owned by another node, such as a user owning a post on social media. Undirected edges can be used when representing two-way connections, such as when two user nodes are “friends” or “colleagues”. There are also many print methods for each class that make it easy to debug implementations if required in the future.

Correctness and Evaluation

In order to evaluate and test the functionality of the graph, I wrote several unit tests that test different capabilities of the graph, such as adding a node to the graph, adding an edge to the graph, adding properties to nodes, and adding edges to nodes.

In order to further test the robustness of my implementation, I created a sample dataset of .csv files that had data on a social media network. It models relationships between users and their associated activities (e.g., posts, likes). Each node represents either a user or a post, and each edge represents a relationship or interaction between these entities. This network allows for the representation of complex social dynamics, such as finding nth-degree connections or aggregating the likes of a user’s friends and colleagues. I first wrote a function called populateGraph that manipulated the .csv file data to create a graph out of the data, and then I created two custom functions that each serve a different purpose, but in general, outline the several capabilities of the graph database system. These two functions are:

Finding nth-Degree Connections: This function identifies all nodes within a specified degree of connection to a given node (e.g., all friends of friends). It uses Breadth-First Search (BFS) to traverse the graph up to the specified degree. This is particularly useful in social networks for exploring indirect connections, such as mutual friends or extended professional networks. Finding Connections and Likes: This function identifies all the direct colleagues and friends of a given user and calculates the total number of likes received by each. It traverses the adjacency matrix to find the user’s neighbors and evaluates their connected posts for likes.

When running the main function, one can run all the unit tests and see that they all pass successfully. Additionally, one can also run the two custom functions that have been created, with the user and degree of their choosing.

Also, they can check the runtime of the functions to see that the performance is surprisingly good. After running the functions over different nodes in the graph database, I found that the average runtime of the Finding nth-Degree Connections was 0.000204667 seconds, and the average runtime of the Finding Connections and Likes was 0.000223388 seconds. Even though this is testing on a relatively small dataset, I was surprised with these results. These experimental results further accentuate the correctness of the implementation, as well as its robustness.

Future Tasks and Improvements

During the course of this project, there are a few things that I wasn’t able to completely implement. Firstly, the graph database that I created is not persistent, which means that if the graph_manager is destroyed and recreated, the graph will need to be recreated. This is an advantage that on-disk implementations have and should definitely be implemented in the future. The main issue with this is that the adjacency matrix takes up more memory than that which is provided in a page, so it is hard to store all the metadata in one page.

Secondly, I was not able to implement the query language support. The queries that I was able to write were written using C++ and not generalized, so that is something that can be improved. For example, being able to add graph query language support, similar to Neo4j, which can allow for DFS and BFS queries is very useful. During our lectures, we learned the implementation for operators in relational databases, and I believe a lot of the same concepts can be extended to graph database query parsing as well.

Furthermore, I was not able to test the implementation of the graph database in comparison to a traditional relational database to compare how much better it does when we are storing dense social network data. This can explicitly showcase the power of graph databases in the real world and why it is chosen over traditional relational databases in many contexts, such as social media, recommendation systems, and knowledge graphs.