Vector range search (or vector search engine) is the task where, given 2 sets of vectors with the same number of dimensions, a database set and a query set, find for each query vector a nearby vector in the database set, such that the mean distance between the query vectors and their corresponding vector in the database is as small as possible.
vector_database = [
[0.05, 0.16],
[0.31, 0.74],
[0.32, 0.8 ],
[0.03, 0.25],
[0.33, 0.07],
[0.88, 0.77],
[0.91, 0.29],
[0.7 , 0.02],
[0.53, 0.04],
[0.72, 0.38]
]
query_vectors = [
[0.89, 0.86],
[0.26, 0.88],
[0.17, 0.41]
]The Euclidean distance from each query vector to the database set is approximately:
distances = [
[1.09, 0.59, 0.57, 1.05, 0.97, 0.09, 0.57, 0.86, 0.9 , 0.51],
[0.75, 0.15, 0.1 , 0.67, 0.81, 0.63, 0.88, 0.97, 0.88, 0.68],
[0.28, 0.36, 0.42, 0.21, 0.38, 0.8 , 0.75, 0.66, 0.52, 0.55]
]It can be seen that, for each query vector, if we select the following vectors in the database, the mean Euclidean distance will be below 0.2:
indexes = [
5, // select vector 5 in database as "nearby" to query vector 0
1, // select vector 1 in database as "nearby" to query vector 1
0, // select vector 0 in database as "nearby" to query vector 2
]
total_distance = 0.09 + 0.1 + 0.28 = 0.47
mean_distance = 0.47 / 3 = 0.16In TIG, the vector search challenge features vectors with 250 dimensions, and uses the Euclidean distance. The set we sample from is the hypercube . The number of Database vectors scales with the number of Query vectors, such that the number Database vectors is the Query vectors multiplied by . The challenge has a configurable difficulty through the parameter which controls the number of query vectors in an instance.
Real-world data is typically clustered, we generate cluster sizes from the log-normal distribution, such that the mean number of points in a cluster is . All vectors in the Query and Database sets are generated in the following way. When a vector is generated it is assigned a cluster center with a probability proportional to that clusters size. Once a vector is assigned a cluster center it is generated from a anisotropic Gaussian with mean equal to the cluster center.
Solutions are given a quality score `mean_distance` which is the mean of the distances of the queries to their assigned neighbours.
Vector search has a wide range of applications an example of which is Threshold-Based Anomaly Detection, where the vector database represents operational data in a high-dimensional space, and query vectors represent new incoming data points to be monitored for anomalies. If the average distance exceeds a predefined threshold, the query vectors are flagged as anomalies.
See also for example Outlier detection for high dimensional data: https://dl.acm.org/doi/abs/10.1145/375663.375668
Another example application of vector search is in network security, where the vector database corresponds to historical traffic patterns, and query vectors are new traffic data. By tracking the mean distance between sets new data points and historic "regular" data, any deviation exceeding a threshold can indicate a potential intrusion.