Search logic behind "Username already exists"

Every time you try to create a new account on any of the websites, you begin with your name and, more often than not, you get the response “Username is already taken”. Then, you add “your name + date of birth”, to realize it also has been “already taken” to finally end up with “your name + date of birth + license plate + graduation” to create the account. I’m sure a lot of you are nodding and saying “been there, done that.”
But how many of you have wondered how these giant sites like Facebook, Instagram, and Gmail verify whether the username is taken or not?

Let's start with the two possible approaches

1) Linear Search

Let’s assume that Facebook stores all the data in its directory. And the software simply checks each name in the list one at a time and if it doesn’t find a match, it tells you if your desired username is available or not.
Doesn’t sound sensible, does it? The software has to look at each name every time a username needs to be verified.
The technique is unreasonable when you compare it with the Facebook database, which has over 1.5 billion users and Twitter which have 300 million users.
2) Binary Search
This makes more sense, with all the brains working at Facebook.
Facebook keeps all the data sorted and arranged in an alphabetized list. The list is 1.5 billion characters long, stored like a, aa,aaa…...xyy, XYZ, yaa,yaa,yxz, zaa, zac very similar to your dictionary.
When you enter a name, it matches it with the username exactly in the middle of the list. If it matches, the software rejects the new username. If it doesn’t match, which has a lot of possibilities, the next question the software check is “ If searched alphabetically, does the requested username comes before or after this username in middle?’
If it comes before, then the software knows that all the 750 million people after the username found in the middle of the list is of no use for the current search.That eliminates 750 million possibilities in the single comparison. If the desired username comes after the name in middle (alphabetically), it eliminates all the names before it.  Either way, the software eliminates almost 750 million names for search in the first comparison.
Next, it takes the selected half of the list and immediately matches the requested username with the name in the middle of the remaining list. If it matches, the requested name is rejected and if it doesn’t, the requested name is again checked for the possibility of it occurring before or after the name in middle.
If it is before, reject the 350 million names after the name. And proceed with divide and conquer for the rest of the names as done earlier. If the requested name is after the middle string, reject the names before it and proceed with 350 million remaining names.
By dividing the list every time we can compare the required username with the names in the list quite quickly...
But the question is how quickly?
You will continue dividing the list into two until you can no longer do so.And when you are left with one name in the database, you match it with your desired username.This would be the last step before we find whether the username ‘chosen’ is available or not.
For a data as big as 1.5 billion, this method would require no more than 30 steps.
2 to the power of 30 gives you 1.73 billion, which is closest to our expectation of 1.5 billion users on Facebook. Which is a fewer number of steps and complication for the same data when searched with a linear search. 

What if the developers are very smart and use a Bloom filter as the solution?

Before we understand Bloom filter, we need to understand the concept of Hashing.
Hashing is like the license plate of your car. A hash function takes data of any length as an input and gives you a smaller identifier of a smaller, fixed length, which is often used to identify or compare data.
Bloom filter works simply - Test and Add.
Test whether the element is present in the list,
  • If it returns false, the element is definitely not on the list
  • If it returns true, the element might be probably on the list. This false positive (Will discuss it below) is a function of the Bloom filter and depends on the size and is independent of the hash function used.
A bloom filter divides memory area into buckets, which are filled with various hashes generated from one or many hash functions.
Let’s understand with an example. Suppose you have a memory bucket of size 10 and 3 hash functions which will give you three unique identifiers. Suppose you enter Ronaldo into this memory bucket. Ronaldo, when passed through these hash functions, gives the value of 1,4,5. The filter quickly fills the memory in the bucket, with these identifiers.
Now we enter Messi into the memory bucket. Messi when passed through the hash function give its own unique identifier. In this case, it is 3,7,8 for which the filter fills the bucket too.
Since the functions always return the same value for similar input, we can be sure that when a Ronaldo’s name is given to the filter, it would check in location 1,4, and 5 to find it full, which means that the name Ronaldo is already on the list.
Let’s continue with another example of entering Rooney into the memory.
Rooney, when passed through the hash function, returns 2,6, and 8. The filters check the memory to find that though 8 is full, but 2 and 6 are empty, which means you don’t have Rooney in the memory. Hence, the name is available.
But when the name Neymar is passed through the hash functions it returns the value of 1,3,7 which eventually makes the filter believe that the name Neymar is already present on the list.
This scenario explains the concept of false positives used in Bloom filters.
One can control a false positive by controlling the size of the Bloom filter. More space is inversely proportional to false positives.
Each of the above-mentioned technique comes with its own advantages and disadvantages. With technology and computers getting smarter and faster every day, even the brute force method seems feasible. But with space and time complexity, most of the companies such as Reddit prefers Binary search. While few others such as Medium use Bloom filters smartly to suggest articles for you without repeating them again on your timeline.
Credits: HackerEarth

No comments:

Post a Comment