Discover how to recreate a Sudoku app using browser APIs and ecosystem advancements, and explore the image processing pipeline technique for extracting Sudoku puzzles and solving them.

[0:00] Hey Everyone!

[0:01] A long time ago I wrote an app for the iPhone that let you take a grab a sudoku puzzle using

[0:08] your iPhone’s camera.

[0:10] Recently when I was investigating my self organising Christmas lights project I realised

[0:16] that the browser APIs and ecosystem had advanced to the point where it was possible

[0:22] to recreate my Sudoku application running purely in the browser.

[0:27] As you can see it works pretty well - you can try it out for yourself using the links

[0:32] in the description and as always, all the code is in GitHub for you to try out for yourself.

[0:38] Hopefully, this video will give you a good idea how the system works and the thinking behind what I’ve done.

[0:46] Let’s take a look under the hood…

[0:49] Let’s define our problem:

[0:52] Given an image identify and extract the Sudoku puzzle, recognise the numbers in each square

[0:59] of the puzzle, solve the puzzle, render the results on top of the original image.

[1:06] As a classically trained image processing guy, my first port of call is to remove redundant information.

[1:14] We don’t care about colours, so we’ll convert the image to greyscale.

[1:18] As you can see we’ve not really lost any information.

[1:22] Looking at the image, we only ready care about paper and ink,

[1:26] so it makes sense to binarise the image and make black or white.

[1:31] Next we need to identify the blob that is the puzzle.

[1:35] Once we’ve got that we can extract the image of the puzzle and extract the contents of

[1:40] each box of the puzzle.

[1:41] Applying some OCR to the contents of each box, we can then solve the puzzle and our job is done.

[1:50] It’s a pretty straightforward image processing pipeline!

[1:54] Let’s dive into each of these steps in turn and see how they work.

[1:58] We’re taking a feed from the camera on the device.

[2:02] This comes into us as an RGB image.

[2:05] We’re not really interested in colour as we’re working with printed puzzlestypically

[2:09] which will be printed in black and white.

[2:13] So our first step is to convert from RGB to greyscale.

[2:18] If we look at the standard formula for this we get this fomula (https://en.wikipedia.org/wiki/Grayscale).

[2:21] We can see that a large portion of the value is coming from the green channel.

[2:26] So we can apply a little shortcut to our RGB to greyscale conversion and just take the

[2:31] green channel of the image.

[2:33] We’re going to be using morphological operations for locating the puzzle

[2:39] typically these work on black and white binary images, so our next step is binarise our image.

[2:46] This involves applying a threshold to seperate out foreground from background pixels.

[2:52] I’ve simulated an image here that has been taken in poor lighting conditions.

[2:58] As you can see from the histogram there’s not a clear seperation of ink and paper.

[3:03] This makes it very hard to apply a global threshold to the whole image.

[3:09] However, if we look at a small section of the image we can see from the histogram that

[3:14] we do have an obvious collection of ink pixels and paper pixels.

[3:19] What we need to do is examine each pixel in the context of its surrounding area.

[3:24] A simple way to do this is to apply a blur to the image and then compare the original

[3:31] image with this blurred image.

[3:35] Doing this gives us a very clean segmented image even when lighting conditions are not ideal

[3:42] We’ve now got a fairly clean image that contains our puzzle along with a bunch of random other

[3:48] printed elements.

[3:50] A classic approach would be to jump straight in here with something like a hough transform to detect lines

[3:56] However, we can be a bit smarter about what we are doing and take a bit of a shortcut.

[4:03] We know that whoever is taking the picture should be pointing the camera at a puzzle.

[4:07] So we can assume that the puzzle would be the largest object in the image

[4:13] We can also apply simple heuristics to filter out elements that probably aren’t a puzzle.

[4:21] If we scan the image pulling out all connected components.

[4:26] The connected component that contains the puzzle should be the one with the most points. And matches our heuristics.

[4:32] This lets us isolate the puzzle grid.

[4:36] With the puzzle grid identified we now need to identify the four corners of the puzzle.

[4:43] Once again we can apply a fairly simple heuristic to this problem.

[4:47] The pixels that are in the four corners should have the smallest manhatten distance from

[4:53] the corners of the max extents of the extracted component.

[4:58] You can see the calculation of manhattan distance in the animations.

[5:04] Runing this algorithm for all four extents of our located puzzle gives us the four corners.

[5:13] We know where the puzzle is in our image.

[5:17] We have the location of the four corner points.

[5:21] To extract the puzzle image we can reformulate our problem into one of a homographic transform.

[5:29] We want to map from the camera image of the puzzle to an ideal image of the puzzle that

[5:35] is not distorted by perpective or rotations.

[5:40] We do this by computing a homography between the two planes.

[5:46] The formula shown can be transformed into this new formula - as you can see we need four points

[5:54] - this maps nicely onto the corner points that we’ve already located for the puzzle.

[6:02] If we rewrite the matrices algebraically we can find h using this algorithm.

[6:07] Please see the linked paper for details on how this is derived

[6:14] and alternative more stable methods for calculating the homography.

[6:19] Now that we have the homography between our ideal image and the camera image we can map pixels between them.

[6:28] This lets us extract a nice square puzzle image from the camera image.

[6:33] Now that we have the square puzzle image we need to extract the contents of each individual cell.

[6:40] We can use the thresholded version of the image to help us with this.

[6:45] Looking at each box in turn we extract the bounds of any connected region starting from

[6:51] the center of each cell.

[6:54] We can then use this bounds to extract an image of the digit from the square greyscale image.

[7:01] If there is no connected region in the center of the cell then we know that it is empty.

[7:07] We now have an image for each populated cell of the puzzle.

[7:13] We’re going to use a neural network to perform OCR on each image.

[7:18] I’m going to use tensorflow to train the network and then tensorflow js to run the network

[7:24] in the browser.

[7:25] Our neural network is going to be trained to recognise the digits 1 through 9.

[7:32] I’ve synthesized a large number of training examples from a wide selection of fonts rendering

[7:38] a greyscale image of each digit.

[7:42] I’ve also added a small amount of blur and some Gaussian noise to each image.

[7:47] Hopefully this will provide a reasonable representation of what we will be getting in a live environment.

[7:54] I’ve also sperated out 10% of the images into a testing data set that we can use to evaluate

[8:01] how well our network performs.

[8:06] I’m using an interactive jupyter notebook to train the network.

[8:11] We’ll import TensorFlow and then set up the batch size

[8:15] and the number of epochs for our training session.

[8:19] I’m going to use a batch size of 32 and run the training for 100 epochs.

[8:26] We’re also going to resize our images to 20 by 20 pixels.

[8:32] We need to remember to also do this in the javascript code so that it matches.

[8:39] I’m augmenting our training data to help the network generalise.

[8:44] This will add some random mutations to our input data set.

[8:49] We’ll split out 20% of our images into a validation set for use during training and also normalises the pixel values.

[9:00] We’ll also have a data generator that does no augmentation and only normalises the pixels values.

[9:08] We can now load the data, splitting our training images into training and validation subsets.

[9:14] Here’s a small sample of our augmented training data.

[9:25] For our problem we don’t need a very complicated model, I have built a single convolution layer

[9:31] followed by a dense hidden layer and then finally an output layer with nine outputs

[9:38] one output for each digit.

[9:40] We’ll log the trainign out to tensorboard and also create a confusion matrix to show

[9:46] us which digits are causing problems.

[9:49] For training, we need to use CategoricalCrossentropy loss function.

[9:57] Running the training we can see that we have pretty good accuracy

[10:01] on both training and validation.

[10:05] Looking at the confusion matrix we can see that “1”s and “7”s are currently getting confused.

[10:15] Once the training is complete we can see that we have good performance on both the training data

[10:21] and the validation data.

[10:24] This indicates that our network is performing well on data it has seen as well as data it

[10:31] has never seen before.

[10:33] Looking at the confusion matrix there are still some issues between 1s and 7s.

[10:40] Let’s save the model and see which images it’s actually failing on.

[10:45] Looking at the failed images I’m pretty happy with the performance.

[10:49] I think the fonts that it is failing on would not be used in the real world so I’m happy

[10:54] to ignore these failures and move on with this network.

[10:58] As I’m happy with neural network structure I’m going to train it on all the data and

[11:06] use the resulting network in my production code.

[11:12] The final training gives us really good acuuracy.

[11:16] Checking to see if any images fail.

[11:19] We now see that no images are incorrect.

[11:23] We’ll convert this model for use in TensorFlow.js.

[11:28] To run our model we need to perform the same image resizing and normalisations as we did

[11:34] during training and then we can simply feed our images into the model and ask it to predict

[11:41] the digits.

[11:44] We can solve a sudoku puzzle using Donald Knuths application of Dancing Links to his

[11:50] Algorithm X for solving the exact cover problem.

[11:55] Dancing links uses the seemingly very simple realisation that we can remove and put back

[12:01] nodes from a doubly-linked list very efficiently.

[12:06] We can see that when we remove node B from the list it still has pointers back to the

[12:12] nodes that used to link to it.

[12:15] This means that node B contains all the information required to add it back into the list.

[12:23] So what is Algorithm X?

[12:27] Algorithm X is an algorithm that solves the exact cover problem.

[12:33] The exact cover problem is solved by finding the selection of rows in the grid that will

[12:37] combine so that there is a 1 in each column.

[12:41] This example is taken from Wikipedia - https://en.wikipedia.org/wiki/Knuth%27s_Algorithm_X

[12:48] Looking at our grid Column 1 has the fewest entries so we select it.

[12:56] This gives us a choice of adding rows A or B to our solution.

[13:03] We try adding row A to our potential solution set.

[13:06] We can see it has an entry in columns 1, 4 and 7.

[13:10] And these columns have entries in Rows A, B, C, E and F

[13:17] Removing these columns leave us with row D.

[13:21] We now have zero rows in column 2 so we have failed to find a solution.

[13:27] We need to backtrack and try again.

[13:32] This time we try adding Row B to our solution.

[13:36] This flags column 1 and 4 for removal which means that rows A, B and C should also be removed.

[13:44] Doing this leaves us with rows D, E and F and columns 2, 3, 5,6 and 7.

[13:53] Now column 5 has the lowest number of entries so we select it.

[13:59] This gives us a choice of Row D. Selecting row D flags columns 3, 5 and 6 for removal

[14:07] which means that rows D and E should also be removed.

[14:12] We now have Row F remaining.

[14:14] We include Row F in our solution which means that columns 2 and 7 are flagged for removal

[14:21] which removes row F.

[14:23] We now have an empty matix which means we have found a solution.

[14:29] The final solution is rows B, D and F.

[14:33] And if you look in the grid you can see that combining those rows together would give you a 1 in each column.

[14:41] So, how does dancing links fit into this?

[14:45] Dancing links solves the problem of backtracking in an efficient way.

[14:50] We can encode the sparse matrix as a doubly linked circular list going left and right

[14:55] for the rows and up and down for the columns.

[14:59] Since these are now linked lists we can use fact that we can efficiently remove and re-add

[15:04] elements to perform the column and row removals and now when we backtrack it can be done very efficiently.

[15:10] This is what a Sudoku puzzle looks like when it’s been turned into a set of constraints.

[15:17] You can see as we add the known numbers.

[15:20] The number of rows and columns decreases.

[15:23] As we run the search algorithm you can see how quickly the search space is reduced.

[15:29] You can also see that it occasionally needs to backtrack before we find the complete solution.

[15:36] To render our results we could draw our text into a square image and then project each

[15:42] pixel onto the image coming from the camera.

[15:45] This would be quite computationally intensive.

[15:49] Alternatively, for each puzzle cell, we could take an imaginary line through the centre of the cell.

[15:55] We can then project that using our homographic transform and use this project line to calculate

[16:01] an approximate height and angle for the cell.

[16:04] We then simply draw the digit at the projected location of the cell with the calculated height and angle.

[16:14] So that’s it for this video.

[16:16] All the code for this project is on GitHub - check the video description for the link.

[16:24] I hope you’ve enjoyed this video - please hit the subscribe button and leave any thoughts

[16:29] you might have in the comments.

[16:31] Thanks for watching and I’ll see you in the next video!