Follow the Data: Choosing the Best Java Abstract Data Structure


How Do I Organize All These Kitties???

When developing a software system, there are some basic things to consider as you determine which data structures and algorithms to implement. Start by asking a few questions about the data being used to point you in the right direction. We want to choose the best Java Abstract Data Structures (ADT) to collect, organize, and access the data.

Questions to Get Started:

  1. What is the purpose of the software? What is the application trying to accomplish (we call this the business use case)?
  2. What type of data will be interacting with the system?
  3. Does the system need to scale, or will data size stay relatively static?
  4. Does the data need to be sorted or unique?
Once those questions are answered you can define the data elements to be used and look at the most efficient way to collect and access them.

My Favorite Java Abstract Data Structures

Here's a quick and dirty guide for selecting the best Java Abstract Data Structures (ADT) for your new software application:
  • Linear Data Objects: These store data in a list, queue, or stack. Choosing the right one is determined by the way data elements will enter and exit the structure. These structures are the least efficient, at O(n), when searching because every element may be compared to find a match.
    • LinkedList or ArrayList: When you need a simple, flexible queue with frequent adds/removes.
    • Queue or PriorityQueue: For classic FIFO functionality, or when priority matters more than insertion order.
    • Stack: This java ADT is deprecated but provides stack FILO functionality, consider using LinkedList with push(), pop(), and peek() methods to provide Stack behavior.
  • Trees: These structures store data objects, or nodes, in a parent/child relationship starting with the root node at the base of the structure. Both Java ADTs below are built upon the Red-Black tree structure. Trees are quite efficient in most cases, at O(log n) because in theory each compare shrinks the search area by half.
    • TreeMap: A sorted, self-balancing binary tree where key value pairs are used to sort and search.
    • TreeSet: A sorted, self-balancing binary tree that holds only unique key data values.
  • Hash Tables: We didn't cover hashing in the first Data Structures and Algorithms class, but it's a powerful structure that vastly decreases the amount of time it takes to find and retrieve data elements, at O(1), it takes very little time and resources. A hash table calculates a hash code from the key data, which is then used as an index that points to a bucket containing similar data elements.
    • HashMap: An unordered collection of data elements with very fast retrieval. 
    • LinkedHashMap: Used when you want the data elements to stay insertion order.
    • HashSet: Unordered, linear data set that only holds unique data values.
Let's put it all together and apply it to our new Kitty application...MEOW!

Kitty Example

The Arizona Humane Society (AHS) has hired us to build an application used to process and manage new Kitties brought to the facility. Let's start by answering the initial questions to get a sense of the required data and how it will be used:
  1. What is the purpose of the software? The purpose of this system is to collect information about the Kitties brought to the shelter and then manage their information until they are adopted.
  2. What type of data will be interacting with the system? There are quite a few attributes that will be collected about each new Kitty arriving.
  3. Does the system need to scale, or will the data stay relatively static? The application will need to have the ability to scale as other facilities implement the system, but initially the data structure will hold no more that 500 records at a time. Once a Kitty is adopted, its record will be archived in the database and removed from the data structure.
  4. Does the data need to be sorted? Yes, the Kitties will need to be sorted so they are easier to search and access.
Ok, let's take this information to make decisions on how best to build the Kitty data in the system. What Kitty data does the AHS collect for incoming Kitties?
  • Name
  • Color
  • Estimated Age
  • Date Received
  • Feral or Friendly
  • Location Found
  • Location in Facility
As we build the Kitty class for our data structure, we need to choose the key element. This key will be the element used to store, sort, and search in the data structure. 

After a discussion with our contact at the AHS, we learned that they manage their animals by the date they arrive at the facility. This means that the "Date Received" above is our key element.

Now we can build our Node attributes:



When selecting the best data structure for our application, we need to consider several key factors:
  • The structure must be sorted to ensure efficient data access.
  • The Node key will always be unique, because ZonedDateTime is set in milliseconds.
  • There will be fewer than 500 Kitty records in the system at any given time (Phase 1) 
  • Search operations will be based on the key element, returning an array of Kitty nodes with the same dtReceived date (the time stamp won't be used when searching)

So which Abstract Data Structure fits these requirements?

👉And the winner is....drum roll please....

TreeSet

Given these requirements, the java TreeSet data structure is an ideal choice because a TreeSet:
  • Automatically sorts and balances elements as they are added
  • Ensures uniqueness of each node using the unique timestamp to add, sort, and search nodes
  • Allows Kitties received on the same day to remain in the same sorted order creating a stable structure
  • Offers an efficient performance with a time complexity of O(log n), making it suitable for scenarios requiring logarithmic scalability. Note: It's log 2 of n because it's a binary tree with two child nodes to each parent node.

So why didn't we choose HashSet? It seemed like a no brainer with its lightning-fast search function?

There are a few reasons, but mainly because it's not sorted. We can't guarantee that Kitties that arrive the same day will be adopted on the same day, so it's more efficient to search on a sorted structure.

Secondly, it can be tricky refining the hash code computation for consistent distribution. This could lead to clustering causing a high number of collisions and an inefficient use of memory.

Related to clustering, the most important reason we don't want to use Hashing for this use case, is that it can be expensive from a memory usage standpoint. Clustering and a growing database can cause allocated memory to grow substantially.

Hashing can be a great option in certain situations, just not for this implementation.

Wrap Up

Looking at the purpose, usage and data elements being used in a software application before designing the system will allow you to choose the most efficient data structures. Writing elegant, streamlined code is a talent that takes practice along with a little finesse.
-
Thanks for reading and if you have question, please leave a comment.

With over 25 years of technical industry experience, Sarah Writtenhouse shares her insights about job market trends, emerging technology and software development.
 
Find her LinkedIn and Medium.

Comments