Food Classification CSE 190 SP 11: May 2011

Wednesday, May 25, 2011

Semaine 9 mercredi - Full Run-down of Process

~ mercredi ~

By the way, final report rough draft here http://acsweb.ucsd.edu/~mezhang/cse190/MabelZhang_cse190_FinalReportRoughDraft.pdf

And... eh... oh! Right. Found some mistakes, one in pure code, one in the way I did the training matrix for layer 1 attribute classifiers. Yeah that was dumb I only gave it positive data and never gave it any negative data so the predictions were weird. But now, it's so much better!

The matrix is sparse, since there are 16 ingredient categories, and each dish usually only contains 3 or 4. Regardless, I can see from the comparison of the ground truth and the prediction results that the places where it should be 0, it got 0, and where it shouldn't be, it actually predicted a percentage that's reasonably close! (It's regression so can't tell the exact accuracy by hard comparison, will probably need threshold if I were to automate the accuracy count. Because the matrix is sparse, most of the accuracy rate is inflated because many (at least ~80% by eyeballing) entries are 0s.)

Pictures! Okay I finally have a pictorial description of what I did.

______________________________________________________________________

First, the original input images (228 total):

(* there are some extra images in the directory, so in case you go count the pictures in the screenshot, no it doesn't add up to 228.)

http://acsweb.ucsd.edu/~mezhang/cse190/input_bibimbap.png
http://acsweb.ucsd.edu/~mezhang/cse190/input_bulgogi.png
http://acsweb.ucsd.edu/~mezhang/cse190/input_pasta.png
http://acsweb.ucsd.edu/~mezhang/cse190/input_pizza.png
http://acsweb.ucsd.edu/~mezhang/cse190/input_sashimi.png
http://acsweb.ucsd.edu/~mezhang/cse190/input_sushi.png

______________________________________________________________________

Images clustered into ingredients regions by color textons (228 total):

Each "grayness" is 1 cluster. It may be discontinuous in the image, depending on where the ingredient is.

(* there are some extra images in the directory, so in case you go count the pictures in the screenshot, no it doesn't add up to 228.)

http://acsweb.ucsd.edu/~mezhang/cse190/textonClusters_all_s.png

______________________________________________________________________

Illustration of the classification process, from raw image to prediction, using one typical image as an example:

1. Original image. A typical image.

http://acsweb.ucsd.edu/~mezhang/cse190/bibimbap01_s.jpg

2. Clustered grayscale image by ingredient region, using Texton-izer. Each grayness is interpreted as one ingredient.
Note that there's a list of small numbers on the upper-left corner. The grayness of the number corresponds to the grayness of the cluster. This number is used as the row number in the label file for the convenience of labeling.

http://acsweb.ucsd.edu/~mezhang/cse190/bibimbap01_s_textonMap.jpg

3. Manually label each gray cluster with an ingredient ID, from the list of a total of 16 ingredients we defined (below). Each ingredient cluster is assigned one ingredient ID.

0 pasta
1 tomato
2 greens
3 red fish, as in tuna
4 seaweed, as in sushi
5 carrots, orange or pink veggies
6 meat, brown bread
7 orange fish, shrimp, fish eggs, fried food
8 white veggies, rice
9 dark veggies, eel-colored ingredients
10 egg yolk, yellow green veggies, cooked
onions, light white fish
11 chili sauce, kimchi, red peppers, red veggies
12 whitefish, pinkish raw fish
13 cheese
14 flour (roasted, yellow), burnt cheese
15 pepperoni, sausage

This image's label file would be the following (row # corresponds to the cluster #, so 0 1 2 3 4 5):
-1
10
8
5
-1
-1

(* last row is just discarded when read into program, it's an extra row when I thought I'd cluster into 6 clusters.)

4. Read back the manual label file. Two things to do,

i. Output area label file to later help build the attribute vector for attribute-based classification (this is NOT the attribute vector YET). Area is measured by the number of pixels in a cluster. The vector is a n by 1 vector, where n = number of clusters.

This image's area distribution is (row # is the cluster #):
1754
29016
11196
7476
6558

Later, we will use the ratio of (an ingredient's area / total food area) to label the attribute vector for layer 1 training of the attribute-based classification.

ii. Output EMD data. For each cluster, use all the RGB points in that cluster to calculate its Earth Mover's Distance with all other clusters in this image. Distance between two points that we gave to EMD is simply the standard formula, square root of (R1-R2)^2 + (G1-G2)^2 + (B1-B2)^2.

A constant NONEXISTENT = 1 is used to indicate ingredients that aren't on the image.

Clusters that don't have an ingredient are discarded. Their EMD vector is later automatically filled with a row of [-1, ..., -1] , before training.

This image's EMD:

-1.000000 -1.000000 -1.000000 -1.000000 -1.000000 145.025467 -1.000000 -1.000000 138.852463 -1.000000 0.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000
-1.000000 -1.000000 -1.000000 -1.000000 -1.000000 44.827419 -1.000000 -1.000000 0.000000 -1.000000 138.755737 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000
-1.000000 -1.000000 -1.000000 -1.000000 -1.000000 0.000000 -1.000000 -1.000000 54.944424 -1.000000 104.543594 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000

EMD with self is 0, so from looking at which column is 0, you can tell what ingredient ID it is. For this, it's ingredients 10, 8, 5, which matches our manual labels, so it's a sanity check.

5. Output attribute vector. Using the manual label file and the outputted area file, we can derive which ingredients exist in the image, and their corresponding areas. We discard the clusters labeled with -1 in the manual file (recall that -1 means there is no ingredients there), and we add up the areas of all clusters with an ingredient, => total food area. Then for each ingredient, we use (ingredient area / total food area) to produce a fraction, which goes into the attribute vector.

The attribute vector is a nx1 vector, where n = number of ingredients, in this case, 16. Each row means the area ratio that this ingredient occupies on the plate.

In English, this means that the vector tells us what ingredients exist on that plate, and how much (the amount) there is, relative to other ingredients, i.e. what portion of the dish is each ingredient.

For this image, it happens to be:

0.000000
0.000000
0.000000
0.000000
0.000000
0.156769
0.000000
0.000000
0.234776
0.000000
0.608455
0.000000
0.000000
0.000000
0.000000
0.000000

All the non-zero rows should add up to 1, which they do.

6. Train attribute classifier (layer 1 of the attribute-based classification) for each ingredient (i.e. the "attribute").

For each ingredient, all the images containing that ingredient are gathered to form the positive data, where the area count is non-zero, and all the remaining images (which do not contain that ingredient) form the negative data, where the area count is 0.

For each image, the training sample is the EMD vector of this ingredient in this image (EMD, or RGB color histogram, or other low-level feature descriptors). The ground truth is this ingredient's area fraction.

So one row of the training matrix looks like:
| EMD data | Ground truth: area fraction |

For the example image, its row in the training data of ingredients 10, 8, 5 are filled, its row in other ingredients' classifier is just all 0s (meaning area is 0, those ingredients doesn't appear in the image).

For ingredient 10:

-1.000000 -1.000000 -1.000000 -1.000000 -1.000000 145.025467 -1.000000 -1.000000 138.852463 -1.000000 0.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000
Ground truth: 0.608455

Ingredient 8:
-1.000000 -1.000000 -1.000000 -1.000000 -1.000000 44.827419 -1.000000 -1.000000 0.000000 -1.000000 138.755737 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000
Ground truth: 0.234776

Ingredient 5:
-1.000000 -1.000000 -1.000000 -1.000000 -1.000000 0.000000 -1.000000 -1.000000 54.944424 -1.000000 104.543594 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000
Ground truth: 0.156769

For all other ingredients:
-1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000
Ground truth: 0.000000

Training is done with SVM RBF kernel regression.

7. Test on test fold, output prediction results.

8. Train layer 2 classifer, the final cuisine classifier. (* in progress)

Training data is the attribute vector.
Ground truth is the cuisine ID, 0 for Italian, 1 for Japanese, 2 for Korean.

9. Test on test fold. (* in progress)

Testing data is a vector extracted from the prediction result outputted in step 7 above.
Ground truth is the cuisine ID.

Saturday, May 21, 2011

Semaine 8 samedi

~ samedi ~

Footnote: Just realized that OpenCV actually has a EMD implementation... O_O Wow. Wow. Wow okay. Well okay whatever.

Found a stupid bug that resulted in all the points used for EMD being the same point. Results from last time are certainly flawed.

Turned on randomly choosing the points to pass to EMD, instead of taking the first 100. The effect is that the distance between two clusters is now somewhat randomized, so it's a bit different every time, and the reversed distance is not exactly the same anymore, e.g. emd(cluster1, cluster2) is not equal to emd(cluster2, cluster1), where as before, it was equal.

Friday, May 20, 2011

Semaine 8 vendredi

~ vendredi ~

Finally got attribute-based classifier code fully running, but result doesn't look quite right... Wonder if it's because there's too few data samples for each attribute. The regression predictions look almost identical for every fold, and the accuracy is 0%. The range is somewhere in there, but there must be either something wrong with the way I'm constructing the matrix, or something wrong with the training.

Ran RGB color histogram and EMD on the complete set of data again, since it is increased 4 times from before. Color histogram did surprisingly better than the last round:

fold#   accuracy%
0       46.666668
1       55.555557
2       57.777779
3       51.111111
4       56.250000
mean    53.472223
var     15.564670993
std dev 3.945208612

EMD, on the other hand, dropped a lot:
fold #    accuracy%
0        9.615385
1        17.391304
2
3
4

Wednesday, May 18, 2011

Semaine 8 mercredi

~ mercredi ~

All 228 images labeled with attribute vector, 4 clusters per image.

All EMD, RGB hist, SIFT data outputted.

Just need to finish up the code that stitches up the matrix for the 2 stages of training. All the data is in place, whew. Hopefully it happens tonight or tomorrow morning.

Monday, May 16, 2011

Semaine 8 lundi

~ lundi ~

Still labeling data for attribute vectors... Just labeled for 2 hours and realized we've been doing it the wrong way. Haha. So wrote a script to do it... Originally tried to avoid having to write the script, but still coming back to it, it's actually not so bad. It might actually be faster.

The new labeling would expand the dataset from 57 images to 228 images, with 76 images per category, Italian (pasta + pizza), Japanese (sushi + sashimi), and Korean (bibimbap + bulgogi). Images are hand-cropped to food area so that we get as little background as possible.

Use 3 low-level features: EMD, RGB color histogram, SIFT.
EMD because it worked well above chance before, so see how well it'll work in attribute-based classification.
RGB color histogram just because.
SIFT, just to try it, because color information is very important for food, and SIFT would discard color information with grayscale images being the input.
It doesn't really make sense to try any other features that discard color information, before seeing how well SIFT would work...

RGB hist and SIFT data are already calculated and outputted to file for loading back into training, just waiting for labeling to be done, so that they can be trained. EMD needs the labeling for the clusters before it can spit out data.

With the auto-script, it should be quicker labeling the attribute vectors. Still need to label the clusters with the corresponding ingredients for the new images though, to feed to the script.

Saturday, May 14, 2011

Semaine 7 samedi

~ samedi ~

Donc... each image has a different number of keypoints when returned from SIFT, but the matrix passed to SVM still has to be the same length. How do we deal with that?

Using SIFT with SVM, paper from Tuebingen University [8]:
http://citeseer.ist.psu.edu/viewdoc/download;jsessionid=61B12FAF72E0849976C4EB6095BB6A04?doi=10.1.1.88.4011&rep=rep1&type=pdf

So these guys above used something called the Bhattacharyya kernel, "trivially related to the better known Hellinger's distance," because these other guys used it http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.67.8610&rep=rep1&type=pdf , who also mentioned chisquared, but I don't know why they didn't use chisquared.

The guys who introduced the attribute-based classification used chisquared.

What's the advantage of each?

Chi-square:
http://www.okstate.edu/ag/agedcm4h/academic/aged5980a/5980/newpage28.htm
For unordered sets, nonparametric techniques are good:
"There are, however, certain advantages of nonparametric techniques such as Chi Square (X²). For one thing, nonparametric tests are usually much easier to compute. Another unique value of nonparametric procedures is that they can be used to treat data which have been measured on nominal (classificatory) scales. Such data cannot, on any logical basis, be ordered numerically, hence there is no possibility of using parametric statistical tests which require numerical data."

For... data of variable length, maybe? That is what I need to know:

"The Chi Square (X²) test is undoubtedly the most important and most used member of the nonparametric family of statistical tests. Chi Square is employed to test the difference between an actual sample and another hypothetical or previously established distribution such as that which may be expected due to chance or probability. Chi Square can also be used to test differences between two or more actual samples."

More about chi-square:
http://math.hws.edu/javamath/ryan/ChiSquare.html

Attribute-based classification:

Layer 1 is from image low-level feature to attributes, in this case, ingredients. Ground truth is labeled by the area of a certian attribute over the entire area of the food (excluding the backgrounds), so a fraction.

There's 13 ingredients, so train 13 attribute classifiers. Since ingredients in images overlap, one image may be used in multiple classifiers' training. For example, if a plate of pasta contains pasta, tomatoes, and meat, it will be used to train all 3 of pasta, tomato, and meat attribute classifiers.

These classifiers will be regressions, seems like, as opposed to discrete categories, because the answer to a image containing an ingredient is not just yes or no (binary), but how much of this attribute it contains. It is reasonable for plates of food because some peppers may be a garnish, instead of a full plate, which makes the difference between a stuffed pepper dish and roasted meat with pepper garnish.

Eh... but then, no they should be categories if chi-square is used, seems like...

Layer 2 is from attribute (the ingredient label) to cuisines.

Classifier will be SVM with chisquare kernel.
Features - SIFT seems good for a standard, I guess, but that discards color information, which is really important for food. Maybe also try RGB/HSV color histograms? Or it might be a good idea to build on top of Tingfan Wu's textons idea and try textons as a feature descriptor for attribute based classification.

Reference:
[8] Eichhorn, Jan, and Chapelle, Olivier. Object Categorization with SVM: Kernels for Local Features. http://citeseer.ist.psu.edu/viewdoc/download;jsessionid=61B12FAF72E0849976C4EB6095BB6A04?doi=10.1.1.88.4011&rep=rep1&type=pdf

Thursday, May 12, 2011

Semaine 7 jeudi

~ jeudi ~

Attribute base classification - a two layer classification that lets you relate the image information to a bunch of high level attributes that human use to describe objects, then save these attribute information for training, and discard the low-level image features. The attribute is a layer between the final classification result and the low-level image feature.

Once a classifier is trained, it can also learn new objects that were not in the training, as long as it can be described by the attributes. The method is used to illustrate that a more realistic testing is possible with this method, that it can identify objects just by imagining the high level attributes, like humans naturally do, and do not have to have seen an example of such a category beforehand.

Paper
http://www.kyb.mpg.de/publications/attachments/CVPR2009-Lampert_%5B0%5D.pdf [7]

Dataset
http://attributes.kyb.tuebingen.mpg.de/

Reference:
[7] C. H. Lampert, H. Nickisch, and S. Harmeling. Learning To Detect Unseen Object Classes by Between-Class Attribute Transfer. In CVPR, 2009. http://www.kyb.mpg.de/publications/attachments/CVPR2009-Lampert_%5B0%5D.pdf

Sunday, May 8, 2011

Semaine 6 dimanche

~ dimanche ~

Cross validation implemented.
Can specify number of folds, and which fold to use as test fold. Program takes a list of files (previous evenly interlaced with filenames from different categories) and separate it into #num_folds number of files, each file is a fold.

When running training, it skips the #test_fold and reads all the images in the other folds as training data. When running testing, it reads only the images in the test fold, and uses it for testing.

All training / testing are ran on low-res images (250 in width).

EMD training on ingredients (not cuisines, just ingredients), each sample row is an ingredient.
ing row: | 0 1 2 3 4 5 6 7 8 9 10 11 | truth

Each column is the EMD distance to other ingredients found in THIS image.
Problem: might be too sparse for effective learning, because each image only contains 1-3 ingredients, 4 at the very most.

12 categories (ingredients), images from 3 cuisines, 19 images each cuisine category, 57 images total.
Images are divided into 5 folds, 11 each in the first 4 folds, 13 images for the 5th fold.
Each fold contains as even as possible number of images by cuisine (e.g. 4 images for pasta, 4 for sushi, 3 for bibimbap, in a 11-image fold), but not by ingredient.

Chance: 1 / 12 = 8.33%

Fold#    %
0       29.6%
1       33.3%
2       27.6%
3       26.1%
4       28.6%
avg     29.04%
var     5.8744
std dev 2.4237

Eh... TODO
Pixels passed to EMD is now the 1st 100 points in a cluster, should uncomment the 100 random points code and see if that's better.

Color Histogram

RGB color space.
20, 50 bins produce same accuracy on each fold.

Chance: 1 / 3 = 33.3%

Fold#    %

0       27.3% (all predicted as 3)
1       27.3% (all 2)
2       27.3% (all 1)
3       27.3% (all 3)
4       30.8% (some 1s some 2s)
avg     28%
var     (.49 + .49 + .49 + .49 + 7.84 ) / 5 = 1.96
std dev 1.4

Food Classification CSE 190 SP 11