~ mercredi ~
By the way, final report rough draft here
http://acsweb.ucsd.edu/~mezhang/cse190/MabelZhang_cse190_FinalReportRoughDraft.pdf
And... eh... oh! Right.
Found some mistakes, one in pure code, one in the way I did the training matrix for layer 1 attribute classifiers.
Yeah that was dumb I only gave it positive data and never gave it any negative data so the predictions were weird.
But now, it's so much better!
The
matrix is sparse, since there are 16 ingredient categories, and each dish usually only contains 3 or 4. Regardless, I can see from the comparison of the ground truth and the prediction results that the
places where it should be 0, it got 0, and where it shouldn't be, it actually predicted a percentage that's reasonably close! (It's regression so can't tell the exact accuracy by hard comparison, will probably need threshold if I were to automate the accuracy count. Because the matrix is sparse, most of the accuracy rate is inflated because many (at least ~80% by eyeballing) entries are 0s.)
Pictures! Okay I finally have a pictorial description of what I did.
______________________________________________________________________
First, the original input images (228 total):
(* there are some extra images in the directory, so in case you go count the pictures in the screenshot, no it doesn't add up to 228.)
http://acsweb.ucsd.edu/~mezhang/cse190/input_bibimbap.png
http://acsweb.ucsd.edu/~mezhang/cse190/input_bulgogi.png
http://acsweb.ucsd.edu/~mezhang/cse190/input_pasta.png
http://acsweb.ucsd.edu/~mezhang/cse190/input_pizza.png
http://acsweb.ucsd.edu/~mezhang/cse190/input_sashimi.png
http://acsweb.ucsd.edu/~mezhang/cse190/input_sushi.png
______________________________________________________________________
Images clustered into ingredients regions by color textons (228 total):
Each "grayness" is 1 cluster. It may be discontinuous in the image, depending on where the ingredient is.
(* there are some extra images in the directory, so in case you go count the pictures in the screenshot, no it doesn't add up to 228.)
http://acsweb.ucsd.edu/~mezhang/cse190/textonClusters_all_s.png
______________________________________________________________________
Illustration of the classification process, from raw image to prediction, using one typical image as an example:
1. Original image. A typical image.
http://acsweb.ucsd.edu/~mezhang/cse190/bibimbap01_s.jpg
2. Clustered grayscale image by ingredient region, using Texton-izer. Each grayness is interpreted as one ingredient.
Note that there's a list of small numbers on the upper-left corner. The grayness of the number corresponds to the grayness of the cluster. This number is used as the row number in the label file for the convenience of labeling.
http://acsweb.ucsd.edu/~mezhang/cse190/bibimbap01_s_textonMap.jpg
3. Manually label each gray cluster with an ingredient ID, from the list of a total of 16 ingredients we defined (below). Each ingredient cluster is assigned one ingredient ID.
0 pasta
1 tomato
2 greens
3 red fish, as in tuna
4 seaweed, as in sushi
5 carrots, orange or pink veggies
6 meat, brown bread
7 orange fish, shrimp, fish eggs, fried food
8 white veggies, rice
9 dark veggies, eel-colored ingredients
10 egg yolk, yellow green veggies, cooked
onions, light white fish
11 chili sauce, kimchi, red peppers, red veggies
12 whitefish, pinkish raw fish
13 cheese
14 flour (roasted, yellow), burnt cheese
15 pepperoni, sausage
This image's label file would be the following (row # corresponds to the cluster #, so 0 1 2 3 4 5):
-1
10
8
5
-1
-1
(* last row is just discarded when read into program, it's an extra row when I thought I'd cluster into 6 clusters.)
4. Read back the manual label file. Two things to do,
i. Output area label file to later help build the attribute vector for attribute-based classification (this is NOT the attribute vector YET). Area is measured by the number of pixels in a cluster. The vector is a n by 1 vector, where n = number of clusters.
This image's area distribution is (row # is the cluster #):
1754
29016
11196
7476
6558
Later, we will use the ratio of (an ingredient's area / total food area) to label the attribute vector for layer 1 training of the attribute-based classification.
ii. Output EMD data. For each cluster, use all the RGB points in that cluster to calculate its Earth Mover's Distance with all other clusters in this image. Distance between two points that we gave to EMD is simply the standard formula, square root of (R1-R2)^2 + (G1-G2)^2 + (B1-B2)^2.
A constant NONEXISTENT = 1 is used to indicate ingredients that aren't on the image.
Clusters that don't have an ingredient are discarded. Their EMD vector is later automatically filled with a row of [-1, ..., -1] , before training.
This image's EMD:
-1.000000 -1.000000 -1.000000 -1.000000 -1.000000 145.025467 -1.000000 -1.000000 138.852463 -1.000000 0.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000
-1.000000 -1.000000 -1.000000 -1.000000 -1.000000 44.827419 -1.000000 -1.000000 0.000000 -1.000000 138.755737 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000
-1.000000 -1.000000 -1.000000 -1.000000 -1.000000 0.000000 -1.000000 -1.000000 54.944424 -1.000000 104.543594 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000
EMD with self is 0, so from looking at which column is 0, you can tell what ingredient ID it is. For this, it's ingredients 10, 8, 5, which matches our manual labels, so it's a sanity check.
5. Output attribute vector. Using the manual label file and the outputted area file, we can derive which ingredients exist in the image, and their corresponding areas. We discard the clusters labeled with -1 in the manual file (recall that -1 means there is no ingredients there), and we add up the areas of all clusters with an ingredient, => total food area. Then for each ingredient, we use (ingredient area / total food area) to produce a fraction, which goes into the attribute vector.
The attribute vector is a nx1 vector, where n = number of ingredients, in this case, 16. Each row means the area ratio that this ingredient occupies on the plate.
In English, this means that the vector tells us
what ingredients exist on that plate, and
how much (the amount) there is,
relative to other ingredients, i.e.
what portion of the dish is each ingredient.
For this image, it happens to be:
0.000000
0.000000
0.000000
0.000000
0.000000
0.156769
0.000000
0.000000
0.234776
0.000000
0.608455
0.000000
0.000000
0.000000
0.000000
0.000000
All the non-zero rows should add up to 1, which they do.
6. Train attribute classifier (layer 1 of the attribute-based classification) for each ingredient (i.e. the "attribute").
For each ingredient, all the images containing that ingredient are gathered to form the positive data, where the area count is non-zero, and all the remaining images (which do not contain that ingredient) form the negative data, where the area count is 0.
For each image, the training sample is the EMD vector of this ingredient in this image (EMD, or RGB color histogram, or other low-level feature descriptors). The ground truth is this ingredient's area fraction.
So one row of the training matrix looks like:
| EMD data | Ground truth: area fraction |
For the example image, its row in the training data of ingredients 10, 8, 5 are filled, its row in other ingredients' classifier is just all 0s (meaning area is 0, those ingredients doesn't appear in the image).
For ingredient 10:
-1.000000 -1.000000 -1.000000 -1.000000 -1.000000 145.025467 -1.000000 -1.000000 138.852463 -1.000000 0.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000
Ground truth: 0.608455
Ingredient 8:
-1.000000 -1.000000 -1.000000 -1.000000 -1.000000 44.827419 -1.000000 -1.000000 0.000000 -1.000000 138.755737 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000
Ground truth: 0.234776
Ingredient 5:
-1.000000 -1.000000 -1.000000 -1.000000 -1.000000 0.000000 -1.000000 -1.000000 54.944424 -1.000000 104.543594 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000
Ground truth: 0.156769
For all other ingredients:
-1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000
Ground truth: 0.000000
Training is done with SVM RBF kernel regression.
7. Test on test fold, output prediction results.
8. Train layer 2 classifer, the final cuisine classifier. (* in progress)
Training data is the attribute vector.
Ground truth is the cuisine ID, 0 for Italian, 1 for Japanese, 2 for Korean.
9. Test on test fold. (* in progress)
Testing data is a vector extracted from the prediction result outputted in step 7 above.
Ground truth is the cuisine ID.