Since my research is about visual data understanding, I thoroughly studied about many visual features and how these features can be described. It seemed that many people who used such features do not have a proper understanding over the feature extraction and description algorithms. These algorithms obviously contain a lot of mathematical procedures which are difficult to understand with an average knowledge about mathematics. The fact causes the users to limit the usage of such features in their studies and applications.
I have written a few articles about the usage of some visual feature descriptors and other image processing techniques in any application, with the source-code hosted in CodeProject.
- Bag-of-Features Descriptor on SIFT Features with OpenCV (BoF-SIFT)
- Bag-of-Features Descriptor on SURF and ORB Features (BoF-SURF and BoF-ORB)
- Image Segmentation using Unsupervised Watershed Algorithm with an Over-segmentation Reduction Technique
While answering the questions in the comments I felt that there are some subtopics in each of these articles which are needed to discuss separately. In this article I am trying to gather the important points from the article forum "Image similarity in very large database" attached to the original article "Bag-of-Features Descriptor on SIFT Features with OpenCV (BoF-SIFT)". I will put them as questions and answer format.
Q. First, how big can the database be? can I create a BoF of, for example, 400,000 images, or would the program run off of memory?
A. In order to create BOF it is worthless to consider that much of images. But in case you used around 400000 images it may overflow the data structure inside the Mat object (not sure but that is the first thing which fill be failed based on the implementation and your system). If that does not fail, and there is not memory leak in the implementation of the functions within openCV, then there will be a great probability to create the BoF without any problem. But definitely you will have to wait with a cup of tea.
One of the correct ways to handle such a large database is following the divide and conquer method. You can split the database to small parts and the obtain BoF. Further you can use a concept-hierarchy instead of a set of concepts in single level, so at a time you only have to classify an image to a small number of classes.
Q. How can I use the descriptor taken from the input image to measure image similarity (I'm not sure how can I identify the image that matched from the BoF).
A. You can compare the BoF descriptors of two images in order to check the similarity, using a standard measurement such as Euclidean distance, Bhattacharyya distance etc. If you need to get a set of similar images from a database then you can use nearest neighbor algorithm with one of above said distance measurements. In some applications people only need to understand what is in the image. This kind of problems is called as visual semantic retrieval. In other words we can extract the meaning or the concept inside the image. In that case you can use an SVM to train several concepts with BOF descriptors of images, so the SVM will be able to classify a BoF descriptor of any given image to one of the trained concept.
Q. So, using BoF to consult large databases in real time is not a good approach if I want to get a good performance (1-2 seconds for a 400,000 images database)?
What do you recommend in this case?
A. Let say we have 2 seconds to do the job, so each image will get 0.005 milliseconds if we do linear search. This is not possible with conventional hardware setup. But what if we create BOF descriptors offline and do the matching online? This is one of the approaches people use when they need to retrieve images without considering the semantics. Here are the steps,
- Create the BoF codebook using a sample of images from the database.
- Create BoF descriptor of each image and save in a database with referencing to the image.
- When the query image is given, extract the bof descriptor and do the matching with descriptors in database. Matching is simply comparing the descriptors. A descriptor is a matrix in simple word. Matrix calculation can be done in parallel with modern computers (search for SIMD). If you really need to further reduce the time you can use a GPU (May be with CUDA capability).
As an alternative solution, do not match image to image. Match what is inside the image. Say, one image contain waterfalls, so you can index the image as waterfall. Once the query image is given you can extract what inside the image and the simply do string search (or simple database search with indexing) which should bring the result within a few milliseconds in a high end computer.
Q. How many images can a sample contain? I guess it depends on how accurate I want the matching, but I'm not sure. And just as a doubt: Using the BoF of the sample I took to create a BoF descriptor of an image that is not in the sample, would the matching with another image be good? Or not that accurate?
A. You are correct but the accuracy and the number of images in the sample is not closely related. Some times it is better to use less number of images while in most of the cases it is better to use somewhat larger sample. Since the SIFT descriptor contains 128 elements there can be infinite number of SIFT descriptors within the finite vector space. Hence more you give it adjusts the cluster centers fair enough to all the descriptors. But since we use a codebook with an extremely lower size compare to the infinite number of descriptors, a large sample may mislead the codebook.
I guess you are asking that what will happen if we use an image which was not in the sample set which we used to create the BOF codebook. So if this is your question, even for completely new image, the codebook can create a fair BoF histogram. Therefore images with somewhat similar distribution of SIFT feature vectors will be matched in most cases. But as always some times (rarely) you can see the matched image may be completely different to the query image. Because SIFT feature vector is not exactly what we human see in an image. SIFT feature is simply a histogram of gradient magnitudes on the 8 gradient angle bins. We human see colors, shapes, textures and if we do matching, then we do it semantically, not by low level features.
So you can expect good accuracy with this but for sure it is not even near perfect.
Please read my original article in codeproject and your comments are welcome.