NGordNet

Want to Demo?

Github Project

About the Project

A browser based tool for exploring the history of word usage in English texts. Provided the backend using Java code for the tool, accepting input and generating appropriate output for display.

Collecting the historical frequencies of all observed English ngrams1 from Google's Ngram dataset, I programmed the backend for a clone of Google's Ngram Viewer, which allows for viewers to visualize the relative popularity of words and phrases. Due to scope, this tool only handles 1gram words and a smaller subset than the full 1gram2 dataset.

Data Structures

  • TimeSeries: A class with a purpose similar to the TreeMap dataset to match each year with the numerical data point of that year. Contains methods to collect all years within the time series years() as a list and all the data data() as a list.
  • NGramMap: A class which utilizes the data and organizes it using the TimeSeries we previously contructed. Some methods include countHistory() for returning the yearwise count of the word for all years, totalCountHistory() for returning the yearwise count of all words in all years, weightHistory() for the yearwise relative frequency of the word in all time, and summedWeightHistory() that returns the yearwise sum of all relative frequencies in certain words for all time.
  • HistoryTextHandler: A class that takes in a data type NGordNetQuery which registers the data collection into the website tool and returns the history of the word that the user has typed in years and count respectively.
  • HistoryHandler: A class that creates a graph visual of the collected data and registers it as a String that contains a base-64 encoded image of the appropriate plot.

Incorporating WordNet Dataset




WordNet groups words into sets of synonyms called synsets and describes the symantic relationships between these words together.

Each node in the graph is a synset which is all groups of words with the same meaning. Words could belong to multiple synsets and therefore could belong to multiple different lists. In order to handle each of these conditions and incorporate them within our visualizer, we must include more data structures for us to gather these words into their proper hyponym datasets.

  • HyponymsHandler: The implementation for the Hyponyms button in the visualizer. This will require reading different types of datasets and synthesizing the results with the dataset that we had generated from the previous data structures. For this to work, we need to implement TimeSeries and NGramMap in order to gain the ability of the countHistory() and totalCountHistory() of each of the data. The button should output a string representation of a list of the hyponyms of the single word, including the word itself. The list should be in alphabetical order, with no repeated words. For instance, in the graph above, the output for "descent" should be [descent, jump, parachuting]. I incorporate the data using text files from hyponyms.txt and synsets.txt.
  • Graph: A class for connecting each synset to the hyponym that aids in outputting into the visualizer. In this class, I incorporate Depth-First Search (DFS) which continuously adds nodes, edges, and traverses to observe the data. Therefore, it converts the WordNet dataset files into a graph to find all hyponyms of the word in the given graph.

Definitions


1. Ngram: a sequence of words and phrases ↩
2. 1gram: individual words ↩