In recent times, many researchers have started adopting Virtual Reality simulations to gain meaningful insights from the complex dataset, as this tech promotes immersion and interactivity. To explore more on the potential of VR in visualizing complex data, we are developing a prototype for immersive, 3-dimensional scatter plot visualization using PCA and t-SNE. This proto is developed using Unity3D and Oculus Go, a relatively inexpensive and modern virtual reality headset available to the general public. This post is a draft which includes some of the techniques and details that we have followed during the development and related demo videos.

This demo video consists of PCA and t-SNE Visualization of COVID-19 Literature Clustering. This is based on a Kaggle problem where, given a large number of literature and the rapid spread of COVID-19, it is difficult for health professionals to keep up with new information on the virus. Can clustering similar research articles together simplify the search for related publications?

By using clustering for labeling in combination with dimensionality reduction for visualization, the collection of literature can be represented in a scatter plot. In this plot, publications of highly similar topics will share a label and will be plotted near each other.

We can take a walk through in VR space to explore the scatter plot and while zooming near the data points, info like paper titles, Abstract, Authors, journals are shown for giving a gist about the paper. If interested in that particular paper, we can click on “more..” button to open the related urls in the browser.

Dataset

COVID-19 Open Research Dataset at Kaggle. This dataset contains 200,000 scholarly articles, including over 100,000 with full text, about COVID-19, SARS-CoV-2, and related coronaviruses. This freely available dataset is provided to the global research community to apply recent advances in natural language processing and other AI techniques to generate new insights in support of the ongoing fight against this infectious disease.

For an experiment, we randomly selected 10,000 articles. In that, we collected only language English.

language_distribution

We tried to do clustering to group similar articles and visualization to simplify the search for related publications.

Clustering

Used given body_text of each article for clustering. As part of featuring engineering, applied tokenization and removed stop words from the article’s body_text. Vectorized each article using the TF-IDF (Term Frequency- Inverse Document Frequency). Reduced dimensions using PCA by preserving 0.95 Variance.

Used K-Means algorithm to cluster the articles. We found optimal “K” (number of clusters) for the given dataset using the Elbow method. We choose k=20.

optimal_k

Visualization

t-SNE and PCA are used for visualizing data in a lower dimension (2D and 3D) and used labels from the k-means algorithm.
visualization