Advancing Bipolar Disorder research through visualizations: A Case Study

Regardless of format, domain of study, or size of dataset, it is standard practice to visualize a new dataset upon obtaining it. Visualization helps us understand the nature of the data (e.g., distribution) but also whether there are missing values — these practices help us understand what questions we can possibly answer with the dataset, especially if it’s a large dataset with many variables. Especially in the case of small datasets that were integrated later to a bigger dataset, the now integrated dataset could be repurposed to ask new questions. However, data harmonization and integration can be challenging when not all constituent datasets may have the same set of variables or use the same exact survey instrument to measure the same construct of interest (e.g., executive functioning).

Larger datasets are needed for older-age bipolar disorder (OABD) research to improve prevention and treatment efforts. Prior research has often relied on smaller sample sizes, which might have contributed to lack of conclusiveness of results. The Global Aging & Geriatric Experiments in Bipolar Disorder (GAGE-BD) study is an initiative that aims to address this issue by integrating data from various research centers and institutions worldwide. By reviewing 53 studies and analyzing data from 19 cohorts with over 1400 individuals, researchers developed a comprehensive data classification system for OABD. This system helped identify key variables across 15 critical domains, including domains such comorbidity, mood cognition, physical health, and symptom severity. This work not only advances our understanding of specifically OABD but also benefits international research consortia studying mood disorders.

The continuous maintenance and upkeep required to integrate and harmonize smaller datasets into a large database is challenging for various reasons. A key challenge is to make the database accessible to general users, who are not involved in the day-to-day upkeep of the database. In this case, the general users may be researchers and clinicians who wish to perform further analyses on the dataset post-hoc, and need a way to explore it. This is where ontology visualizations that enable exploration could be useful for general users who want to know how their existing datasets compare to the complete ontology of categories on OABD.

In this case study, I will walk through the code chunk-by-chunk to illustrate how to build an ontology visualization in R, using multiple libraries such as data.tree, igraph and networkD3.

First we load the csv file that contains the different levels of the domains and sub-domains in this dataset. In this OABD-specific ontology, there is there is a total of 20 main domains (or level 1-domains) and further subdomains below each domain. The goal of this ontology visualization is to create visualizations displayed on separate webpages for each main domain (e.g. Clinical characteristics page).

varicat_ontology <- read.csv("varicat_ontology_full_names.csv", sep=",", header=TRUE)
# reverse order (actually only relevant if making dendogram)
rev_df <- apply(varicat_ontology, 2, rev)

# converting the result to dataframe
rev_df <- as.data.frame(rev_df) %>% mutate_all(., list(~na_if(.,"")))
rev_df <- as.data.frame(apply(rev_df,2, str_trim)) %>% arrange(level1) # remove trailing space

# Modify 'rev_df' by concatenating 'level1' and 'level2' if 'level2' is "Other".
# Concatenate 'level2' and 'level3' if 'level3' is "Other".
# Add "Domain" to 'level1' if 'level1' is equal to 'level2'.
# Replace '/' with '_' in 'level3', 'level2', and 'level1'.
# Replace spaces with '_' in 'level3', 'level2', and 'level1'.

# This code makes sure that if the category 'Other' exists in different main domains, eg
# if there is Other in Clinical Characteristics and Metadata, they will be renamed as xxx_Other so that there will be unique 'Other' s

rev_df <- rev_df %>% dplyr::mutate(
  level2 = (ifelse(level2 == "Other", paste(level1, level2), level2)),
  level3 = (ifelse(level3 == "Other", paste(level2, level3), level3)),
  level1 = (ifelse(level1 == level2, paste(level1, "Domain"), level1))
) %>% mutate(
  level3 = str_replace_all(level3, "/", "_"),
  level2 = str_replace_all(level2, "/", "_"),
  level1 = str_replace_all(level1, "/", "_"),
  level3 = str_replace_all(level3, " ", "_"),
  level2 = str_replace_all(level2, " ", "_"),
  level1 = str_replace_all(level1, " ", "_")
)

The code above shows preliminary data cleaning to ensure that the values in the sub-domains are unique to the main domain. We cannot have duplicate vertex ID when creating these network visualizations using the R igraph library. A unique node needs to be referenced; hence we need to specify unique Othercategories (e.g,. Metadata_Other ,Demographics_Other) when they belong to different domains. There cannot be duplicated Others in multiple main domains.

After data cleaning, we need to prepare conversion of the data frame into a data.tree structure. We start by defining a pathString, as a new column in the data frame. The pathString describes the hierarchy by defining a path from the main domain (root) to each sub-domain (leaf).

The paste5()function below extends the functionality of the built-in paste() function by providing the option to remove NAs from the input vectors or columns before concatenation. (This function’s code was found here). We want to remove the NAs from data frame and concatenate the different names of the domains and subdomains together. Subsequently, we group the data frame by the main domain (level 1), and wes split this data frame into many smaller ones. Now each data frame only consist of one main domain and their sub domains.

# create pathString to make the data tree
rev_df$pathString <- paste5(rev_df$level0,
 rev_df$level1,
 rev_df$level2,
 rev_df$level3,
 sep = "/", na.rm=TRUE
 )

grp <- rev_df %>% group_by(level1)
ls_dfs <- group_split(grp)

##                                                    pathString   my_id
## 1                                    Clinical Characteristics   NA
## 2                                           Current Diagnosis   1
## 3    Clinical Characteristics/Current Diagnosis/Rapid-Cycling   2

Now we’re ready to convert it to a data.tree object and eventually create a networkD3object, which could be visualized later. We do so by defining a function create_fn, which accomplishing a few goals: converting the data.tree object to an igraph object, with as.igraph() and then to a networkD3 object, with igraph_to_networkD3(). There are several data manipulation steps required in between to get the data into the form required by these functions.

create_fn<- function(rev_df){
# Create a data.tree object 'ontology' from the 'rev_df' data frame, removing rows with NAs.
ontology <- as.Node(rev_df, na.rm=TRUE)

# Convert the data.tree object 'ontology' to an igraph object 'g' with edges pointing upwards (climbing direction).
# Delete vertices with the label "Variable" from the graph.
# We delete those vertices since there root/main node should be the main domain
g <- as.igraph(ontology, direction = "climb") %>% delete_vertices("Variable")

# Sort 'rev_df' by 'level1' and 'level2', and add a new column 'l1_id' to factorize 'level1'.
code_temp <- rev_df %>%
  arrange(level1, level2) %>%
  mutate(l1_id = factor(level1)) # we use the 'l1_id' for later

# Create a new data frame 'code_temp' by selecting specific rows from 'rev_df' and grouping them by the domain.
# this column
code_temp <- bind_rows(code_temp[c(2, 6)], code_temp[c(3, 6)], code_temp[c(4, 6)])

# Create 'membership_custom' by combining 'code_temp' column 'l1_id', which is the main domain for that data frame
# we also collapse all the sub-domains (level 2) and sub-sub domains (level 3) into a single column
membership_custom <- cbind(code_temp[2], name = do.call(pmax, c(code_temp[-2], na.rm = TRUE))) %>% distinct() %>% drop_na() %>% arrange(name)

# Convert the igraph object 'g' to a networkD3 object 'd3' for visualization purposes.
d3 <- igraph_to_networkD3(g)

# Merge the 'd3$nodes' data frame with 'membership_custom' on the "name" column to obtain color groupings.
# Purpose of 'membership custom' is so that we specify chosen colors for each domain
membership_custom <- merge(d3$nodes, membership_custom, on = "name")

# Reorder the 'd3$nodes' data frame based on the original order in the data.tree object 'ontology'.
d3$nodes <- membership_custom[match(d3$nodes$name, membership_custom$name),]

# make a grouping variable that will match to colours
 d3$nodes <- d3$nodes %>%
 mutate(color_group = case_when(
    name %in% c("Clinical_characteristics",
    "Cognitive",
    "Clinical_Trial-specific",
    "Course_of_treatment",
    "Course_of_bipolar_illness-episodes",
    "Course_of_bipolar_illness-symptoms",
    "Course_of_non-bipolar_psychiatric_illness",
    "Current_illness_severity",
    "Current_pharmacological_treatment",
    "Demographics",
    "Ecological_Momentary_Assessment",
    "Family_history",
    "Lifetime_pharmacological_treatment",
    "Metadata",
    "Trauma_stress",
    "Physical_health",
    "Non-pharmacological_treatment",
    "Physiological",
    'Miscellaneous_Domain',
    "Positive_psychological_factors") ~ "main",
     TRUE ~ paste0(l1_id, "_sub")
  ))
   return(d3) # returns a D3 object, which produces two dfs
}

The output of print(ontology)looks like this:

Variable
°--Clinical_characteristics
¦--Current_inpatient_vs_outpatient_status
°--Current_diagnosis
¦--Other_diagnostic_specifiers
¦--Current_non-bipolar_diagnosis_psychiatric_comorbidity
¦--Current_substance_or_alcohol_abuse_dependence_diagnosis
¦--First_episode?
¦--Most_recent_affective_episode_type
¦--Current_episode_type_(e.g._manic_depressed_euthymic_mixed_remitted)

And the output of print(membership_custom) looks like this. membership_custom and color_group together help specify the different colors of the nodes for each main domain.

l1_id                    name
<fctr>                   <chr>
Clinical_characteristics BD_diagnosis_description
Clinical_characteristics BD_diagnostic_code
Clinical_characteristics Bipolar_subtype_(Bipolar_1_or_2_or_NOS)
Clinical_characteristics Clinical_characteristics
Clinical_characteristics Current_diagnosis
Clinical_characteristics Current_episode_type_(e.g._manic_depressed_euthymic_mixed_remitted)
Clinical_characteristics Current_inpatient_vs_outpatient_status
Clinical_characteristics Current_non-bipolar_diagnosis_psychiatric_comorbidity
Clinical_characteristics Current_substance_or_alcohol_abuse_dependence_diagnosis
Clinical_characteristics Diagnostic_group_(e.g._Bipolar_vs_other_psychiatric_diagnosis_or_HC

The function create_fn() produces a d3 object, and we call this function within another function we define below here called visualizeDomainNetwork() . In this function we accomplish the final few steps needed to create network visualizations using the networkD3 package. More specifically, the forceNetwork() function defines the attributes of the visualization (see code comments for details).

Additionally, the my_color variable, when evaluated in the appropriate JavaScript context in colourScale argument in forceNetwork() will be used to map the unique main domains .domain() to their corresponding colors specified in .range(). When the network viz is rendered, each category (e.g., “Clinical_characteristics_sub”) will be consistently associated with the same color (e.g., “#7FC07F”). This allows for consistent and visually distinguishable colors to represent different groups or categories in the visualization. The colors defined in the range are assigned to the categories in the domain in the same order they are specified.

Finally, we define a clickjs variable that stores JS code . The code specifies that the nodes should increase in size and the label should appear when the mouse is hovering over it or when it it clicked on. When it is clicked on, the label remains until the user clicks on it again. Otherwise, when a user hovers over it and moves the mouse away, the label disappears.

visualizeDomainNetwork <- function(df) {

my_color <- 'd3.scaleOrdinal()
  .domain([
   "Clinical_characteristics_sub", "Cognitive_sub",  "Course_of_bipolar_illness-episodes_sub", "Course_of_bipolar_illness-symptoms_sub",
   "Course_of_non-bipolar_psychiatric_illness_sub", "Current_illness_severity_sub"   , "Current_pharmacological_treatment_sub", "Demographics_sub", "Ecological_Momentary_Assessment_sub"  ,
   "Family_history_sub"  , "Lifetime_pharmacological_treatment_sub", "Metadata_sub"  ,  "Trauma_stress_sub"  ,  "Physical_health_sub", "Physiological_sub" ,
   "Positive_psychological_factors_sub", "Clinical_Trial-specific_sub", "Course_of_treatment_sub",     "Non-pharmacological_treatment_sub", "Miscellaneous_Domain_sub",
   "main"
  ])
  .range([
    "#7FC07F",  "#BEAED4","darkgreen","#C90B20","violet",
    "#F0027F","#BF5B20","#666666","#1B0E77","#D95F00",
    "#7570B3","#E7290A","#66A61E","#E6AA00","#A6761D",
    "#fff733","#A6CED3","darkmagenta","#FB9A99","turquoise",
    "black"
  ])'  # the color black is specified for the main domain (level 1) nodes

viz_name <- unique(df$level1) # get the main domain name  (e.g., clinical characteristics) and use it to save as the file name later
d3 <- create_fn(df) # d3 object produces two dfs, d3$nodes and d3$links

# Create a network visualization using the 'forceNetwork' function from the 'networkD3' package.
# The 'Links' argument specifies the link data (edges) and 'Nodes' argument specifies the node data (vertices).
# Node and link attributes are provided using 'Source', 'Target', 'NodeID', and 'Group' arguments.
# The color of the nodes is determined by the 'color_group' attribute, which is specified based on the 'level1' (main domain) values.

fn <- forceNetwork(Links = d3$links, Nodes = d3$nodes ,
fn <- forceNetwork(
  Links = d3$links,                  # Input data frame for links (edges) between nodes.
  Nodes = d3$nodes,                  # Input data frame for nodes (vertices) of the network.
  Source = 'source',                 # Column name in 'Links' representing the source nodes of the links.
  Target = 'target',                 # Column name in 'Links' representing the target nodes of the links.
  NodeID = 'name',                   # Column name in 'Nodes' representing the unique ID of each node.
  Group = 'color_group',             # Column name in 'Nodes' representing the groups (categories) of nodes.

  opacity = 1,                       # Opacity of nodes and links (values between 0 and 1).
  zoom = TRUE,                       # Enable zooming functionality in the visualization.
  linkDistance = 0.0001,             # Desired link distance (higher values lead to more spread-out layout).
  radiusCalculation = 0.01,          # Controls the node size (higher values make nodes larger).

  charge = -70,                      # Controls node repulsion (negative value attracts nodes).
  fontSize = 20,                     # Font size for node labels (e.g., the names of nodes).
  fontFamily = "Calibri",            # Font family for the node labels.

  colourScale = my_color             # The color scale function used to assign colors to nodes based on their groups.
)


### === specify the path of the folder where you want to save the output jpeg and html ===
filename = paste0("full_ontology/",viz_name,".html")
jpegname = paste0( "full_ontology/", viz_name,".jpeg")

# Define a custom JavaScript function 'clickjs' to handle node click events in the visualization.
# When a node is clicked, the label will remain until the user clicks it again
# When the user hovers over the node, the label will appear but will disappear when the mouse moves away
clickjs <- "function(el, x) {
  var options = x.options; // Store the options passed from R

  // Select the SVG element
  var svg = d3.select(el).select('svg');

  // Select all nodes and links in the SVG
  var node = svg.selectAll('.node');
  var link = svg.selectAll('link');

  // Store the mouseout event listener for nodes
  var mouseout = d3.selectAll('.node').on('mouseout');

  // Function to calculate the node size based on options
  function nodeSize(d) {
    if (options.nodesize) {
      return eval(options.radiusCalculation);
    } else {
      return 6;
    }
  }

  // Add click event listener to all nodes
  d3.selectAll('.node').on('click', onclick);

  // Function to handle node click events
  function onclick(d) {
    if (d3.select(this).on('mouseout') == mouseout) {
      // If node is not clicked, assign mouseout_clicked event listener
      d3.select(this).on('mouseout', mouseout_clicked);
    } else {
      // If node is clicked, assign regular mouseout event listener
      d3.select(this).on('mouseout', mouseout);
    }
  }

  // Function to handle mouseout event for clicked nodes
  function mouseout_clicked(d) {
    // Reset opacity of nodes and links
    node.style('opacity', +options.opacity);
    link.style('opacity', +options.opacity);

    // Transition the node circle to its original size
    d3.select(this).select('circle').transition()
      .duration(750)
      .attr('r', function(d) { return nodeSize(d); });

    // Transition the node text to its original position and font size
    d3.select(this).select('text').transition()
      .duration(1250)
      .attr('x', 0)
      .style('font', options.fontSize + 'px ');
  }
}"

# Render the network visualization with the custom 'clickjs' function applied to handle node click events.
# The visualization is then saved as an HTML file with the appropriate filename in the "full_ontology" folder.
htmlwidgets::onRender(fn, clickjs) %>%  saveNetwork(filename)

# Additionally, a snapshot of the visualization is taken and saved as a JPEG image with the same filename in the folder.
webshot(filename, jpegname, vwidth = 400, vheight = 400, zoom = 0.8)
}

num_dfs = length(ls_dfs)-1
# run the function on each of the data frames containing the variables for each main domain
for(i in 1:num_dfs) {
 visualizeDomainNetwork(ls_dfs[[i]] )
}

We then save the ontology visualizations generated in jpg and html format. The jpegs are displayed on the main page of the website , and each of the main domains can be further explored interactively as network visualizations on their respective webpages (essentially html files we generated above).

The complete code can be found on this github repository. This blogpost details the process of creating the network visualizations for the full ontology. However, what if you are a researcher and would like to visualize your dataset on OABD, and would like to know how your dataset compares to the full ontology, we also have provided sample code for visualizing your own dataset (with several limitations ever). Our website also shows example visualizations of other datasets (e.g., Inflammaging dataset collected at the BRAIN Lab at UCSD, and the integrated dataset which is a joint effort led by GAGE-BD consortium). To use the sample code however, your dataset must be formatted in specific ways so that the data cleaning works on your dataset. More specifically, this is because the Inflammaging and Integrated dataset were both downloaded from the REDCap database so their formatting (e.g., column names) are the same when the csv file is downloaded.

Ontology visualizations have the potential to not only help researchers with their data integration and harmonization efforts, but even potentially visualize the completeness of their study design before data collection. During the research design phase, researchers upload the variables they intend to collect and create the network visualization and compare the visualization generated to the full ontology, just as how we are able to compare the integrated and inflammaging dataset with the complete ontology visualization now. This is another example of how visualization tools can help researchers assess their current research design in terms of what domains they have not have considered, or to assess if their instruments are collecting data on variables that belong to the same domain, hence preventing duplicate effort.

Ontologies provide common language for sharing information. This issue is evident not only in clinical domains, but also areas such as agriculture and food safety. As long as data remains isolated in private databases and use custom terminology, efficient data exchange will be difficult. This is why consortium-driven effors tsuch as FoodOn and ORCHESTRA exist —the former enhances food traceability, making sure we have accurate and consistent information about the foods we consume, regardless of cultural differences or geographic location, and the latter supports retrospective and prospective research on COVID in order to generate rigorous evidence to improve prevention and treatment.

By creating a common vocabulary, VariCat enables data harmonization and supports data-sharing, which is crucial for discovering generalizable and more global insights into OABD.

This article was originally published on Medium.