Gaussian mixture models are a class of probabilistic unsupervised learning algorithms. A GMM assumes that the process that generated the data can be described by a finite number of Gaussian distributions with unknown parameters. The learning problem is then to find those parameters. This can be done via maximum likelihood estimation (MLE). The concept of MLE is to maximize some likelihood function so that given an assumed model, the observed data is most probable. From a Bayesian standpoint, MLE is equivalent to maximum a posteriori (MAP) with a uniform prior.

The simplest algorithm to do MLE iteratively is expectation maximization. It consists of two steps the (E)xpectation step and the (M)aximization step. In the E step, the expected value of the log likelihood function is calculated. The M step then chooses parameters to maximize the function calculated in the E step. This process is then repeats until convergence.

More detail on the theory behind these can be found in Bishop’s *Pattern Recognition and Machine Learning*, the psudocode from which I try to follow as close as I can.

I tried to keep the implementation as straightforward as possible.

```
def GaussianMixtureModel(X, k = 2, max_iters = 10000, eps = 1e-3):
# I followed the psudocode on Bishop p 438-439 as close as I could
# select initial means randomly
means = X[np.random.randint(0,len(X),k)]
# initiate responsibilities
respons = np.zeros((X.shape[0], k))
# initiate mixing coeffs pi_k
mix = np.ones(k)
# hardest part: getting correct shape here
# I used the shape returned by sklearn.mixture._estimate_gaussian_covariances_full
# which does an initial M step for the cov as well, but I didn't
covs = np.full((k, X.shape[1], X.shape[1]),
np.cov(X.T))
log_likelihoods = [0]
for i in range(max_iters):
# E step
llh , respons = E(X, k, mix, means, covs, respons)
# M step
mix, means, covs, respons = M(X,k, mix, means, covs, respons)
# stop when the likelihood doesnt change with error eps
if np.abs(llh - log_likelihoods[-1]) <= eps:
return means, respons, mix, covs, log_likelihoods
# note down log likelihoods for learning curve
log_likelihoods.append(llh)
print("Max Iterations Exceeded")
return means, respons, mix, covs, log_likelihoods
def log_likelihood(respons):
ret = 0
for i in range(len(respons)):
ret += np.log(np.sum(respons[i]))
return ret
def E(X, K, mix, means, covs, respons):
# calc the llh and re-evaulte the responsibilities
for k in range(K):
# Bishop 9.23
# easier to split up fraction and normalize later
respons[:, k] = mix[k] * multivariate_normal(means[k], covs[k]).pdf(X)
llh = log_likelihood(respons)
respons /= respons.sum(axis = 1, keepdims = True)
return llh, respons
def M(X, K, mix, means, covs, respons):
# re-estimate params using current responsibilities
# define N_k as rowwise sum of responsibilities
# Bishop 9.27
N_k = respons.sum(axis = 0)
# Bishop 9.24
means = respons.T@X / N_k # recalc means
for k in range(K): # recalc variances
# Bishop 9.25
covs[k] = (respons[:, k] * (X - means[k]).T@(X - means[k])) / N_k[k]
mix = N_k/len(respons) # Bishop 9.26
return mix, means, covs, respons
```

Now let’s test our code on some generated data.

```
mu_x = np.array([5,10])
mu_z = np.array([6,11])
sig_x = np.zeros(2)+ 0.01
sig_z = np.zeros(2)+ 0.05
X = multivariate_normal.rvs(mu_x, sig_x, size = 100)
Z = multivariate_normal.rvs(mu_z, sig_z, size = 100)
plt.scatter(*X.T)
plt.scatter(*Z.T)
```

We’ve generated two different clusters, each with a different underlying distribution. Let’s try to run our model and see if it can correctly find the parameters of the distributions.

```
dat = np.vstack((X,Z))
means, respons, weights, covs, log_likelihoods = GaussianMixtureModel(dat)
print(means)
print(covs)
```

[[ 4.99037355 9.99691058]

[ 6.03674847 11.03166029]]

[[[ 0.01063869 -0.00127132]

[-0.00127132 0.0099645 ]]

[[ 0.05877937 0.00100672]

[ 0.00100672 0.05401486]]]

We appear to be quite close. Let’s visualize the 1,2,3 sigma contours.

```
def plt_contours(X, preds, means, covs,
minx = -500, maxx = 800,
miny = -500, maxy = 5000):
t = np.linspace(minx, maxx, 100)
h = np.linspace(miny, maxy, 100)
w_0, w_1 = np.meshgrid(t,h)
z = multivariate_normal(means[0],covs[0]).pdf(np.dstack((w_0, w_1)))
plt.contour(t, h, z, levels = 3)
z = multivariate_normal(means[1],covs[1]).pdf(np.dstack((w_0, w_1)))
plt.contour(t, h, z, levels = 3)
plt.scatter(*X.T, c = preds)
plt.show()
preds = np.argmax(respons, axis = 1)
plt_contours(dat, preds, means, covs,
minx = 4, maxx = 7,
miny = 9, maxy = 12)
```

As you can see, the contour bands are about where we could expect, as we know the variance in the purple cluster is quite small compared to the yellow cluster. I’d call this a success!

I’m sure the code could be made a bit faster, but sklearn’s implementation exists for production, so I opted for clarity. Anyway, thanks for reading and hopefully this toy implementation/example was useful.

]]>This article presents a notable advance toward the development of a new method of increasing the single-axis tracking photovoltaic (PV) system power output by improving the determination and near-term prediction of the optimum module tilt angle. The tilt angle of the plane receiving the greatest total irradiance changes with Sun position and atmospheric conditions including cloud formation and movement, aerosols, and particulate loading, as well as varying albedo within a module’s field of view. In this article, we present a multi-input convolutional neural network that can create a profile of plane-of-array irradiance versus surface tilt angle over a full 180 ∘ arc from horizon to horizon. As input, the neural network uses the calculated solar position and clear-sky irradiance values, along with sky images. The target irradiance values are provided by the multiplanar irradiance sensor (MPIS). In order to account for varying irradiance conditions, the MPIS signal is normalized by the theoretical clear-sky global horizontal irradiance. Using this information, the neural network outputs an N -dimensional vector, where N is the number of points to approximate the MPIS curve via Fourier resampling. The output vector of the model is smoothed with a Gaussian kernel to account for error in the downsamping and subsequent upsampling steps, as well as to smooth the unconstrained output of the model. These profiles may be used to perform near-term prediction of angular irradiance, which can then inform the movement of a PV tracker.

]]>The field of active learning has many different approaches. This section focuses on the Query-by-Committee (QbC) framework, which uses ensembling methods to find the best sample to query the oracle for a label. There are generally two parts to this approach. The first part is to construct and train a model ensemble. Two methods are implemented in this work: bagging and boosting. Bagging has the advantage of simplicity, but boosting often gives a larger performance increase. The second part is finding the most optimal example to query the oracle. This is done by finding the maximum “disagreement” of the classifiers, which is done via a variety of methods, including entropy and KL divergence. Overall, the QbC method allows comparable or greater accuracy to a classifier trained on the whole dataset, but with a vastly reduced number of required samples. This work proposes a new QbC framework called jackAL based on the jackknife; this method offers an advantage over the others because it allows the model to maximize small quantities of data, which is often the case when active learning is required. A variation on the jackknife, jackknife-k is explored as well.

The first two algorithms implemented in this work come from Abe and Mamitsuka, 1998 [5], and also show up in Milidiú et al, 2012 [6] for natural language processing tasks and Körner et al [7] for the multiclass case. First, we give an overview of the paper; then, we detail our implementation of the QBag and QBoost algorithms. The paper begins by discussing the general concept of QbC. The theory behind QbC is that an ensemble of an ideal randomized learning algorithm will choose a query point to maximize the information. However, this assumes that the base learner is the Gibbs learner, which is intractable. Therefore, this paper presents two committee construction methods, Query-by-Bagging (QBag) and Query-by-Boosting (QBoost) instead.

The paper then discusses the pros and cons of using QBag and QBoost. Namely, QBag is simpler, and can isolate and minimize the variance in the data (but not the bias). QBag does this by subsampling from an identical distribution, meaning that it can minimize the small statistical variation in the data over many iterations. With QBoost, on the other hand, the sampling distribution itself can change as well, depending on the ensemble. Both algorithms follow a similar structure. This is summarized in Algorithm 1, which is a paraphrase of the generalized QbC algorithm provided in the paper.

The model predicts new examples via simple majority vote. The paper proceeds to test and compare the methods on a variety of datasets, and finds QBoost to be the better algorithm. We implemented both QBag and QBoost, and found both to be fairly straightforward modifications of Algorithm 1. Note that the original authors chose to use decision trees (C4.5) as the base algorithm, whereas this work uses multinomial naive Bayes; this choice was made for consistency across the other sections of this report. C4.5 could easily be dropped in, as the sklearn framework is very modular. Additionally, QBag and QBoost use a majority vote/average approach for disagreement, whereas we use the entropy approach discussed above.

Other differences from the original implementation was that we chose not to use cross validation; due to both performance requirements and again consistency across the report. The authors also tried a variety of test subsets (as joint and disjoint sets from the query and training data) in order to better simulate a true active learning scenario, as in this case, all labels are known a priori, so the accuracy evaluated at each step across the test or the test + training data would actually be unknown. For simplicity, we evaluate the algorithms using the whole dataset; this makes sense because we wish to see how quickly each converges to the expected value, which is a model trained on the whole dataset initially. The authors also conclude that the time complexity of a QbC model is O(NTR * F(N)), where F is the learning algorithm, N the input size, T the number of times the resampling is done, and R is the number of query candidates. This is noted by the Abe and Mamitsuka to be significant but not intractable. They additionally note that QBag is trivially parallel, although QBoost is not.

QBag and QBoost have the advantage of simplicity, but they are certainly not the only ensembling methods. The third algorithm implemented in this work, DECORATE is from Melville and Mooney 2004 [18], and is adapted into a QbC method called ACTIVE-DECORATE in Melville and Mooney 2005 [19]. The authors of this paper seek to create an ensemble that is specifically very diverse. However, this is not done with resampling, like QBag and QBoost; instead, the authors propose inserting artificially generated data into the training corpus in hopes that this will cause further variation among the models in the ensemble.

The algorithm begins by training a base classifier on the whole training set. Then, for i iterations, artificial training data is generated, based on the distributions of the training set. The authors suggest the size of the artificial set to be between 50%-100% of the training set. Then, a model is trained on the union of the training and artificial sets, and added to the ensemble. If the ensemble gives a better overall accuracy on the training set, then the model is accepted into the ensemble, else, it is rejected. This process continues for a set number of iterations or until the desired number of models has been obtained. New examples are classified by taking the average of the predicted probabilities of each model, and taking the maximum, as in Equation 1.

In their further work, the authors expanded the DECORATE algorithm to become an active learning method. The authors argue that ACTIVE-DECORATE will work better because DECORATE purposefully generates diverse ensembles, therefore, there will be more diversity in their predictions, which allows the votes of the ensemble to be less homogeneous. This increase in disagreement allows the active learner to get a better understanding of what the current training corpus lacks, thereby choosing better examples earlier then a resampling method like QBag or QBoost. Although the authors do not give an example for why, we posit that an explanation for this is that when the sample size is small, a the very beginning, QBag and QBoost both have a tendency to select redundant examples, and if the classes drawn by the subset are unbalanced, this can cause failure. For example, if the current training corpus is of length 5, and 4 of the 5 are positive, and the possible query options are all of the omitted class, then the model will have a tough time making a decision.

As with QBag and QBoost, the authors of ACTIVE-DECORATE suggest a decision tree algorithm be used as a base learner, whereas we again implemented multinomial naive Bayes. An interesting claim to note is that since naive Bayes is seen as a “stable” learning algorithm, it supposably does not minimize the variance due to statistical noise as well as unstable learners. This may factor into the performance of our ACTIVE-DECORATE . The authors also claim that DECORATE is on par with AdaBoost; we do not find this to be universally true, but also do not do a rigorous comparison between the two, as we focus on the active learning variant instead. The authors also explicitly note that boosting can fail with small sample size; this can be seen in our implementation as well in the Results section.

The authors offer two hypotheses for why their ACTIVE-DECORATE method is optimal. The first is that ACTIVE-DECORATE explicitly fosters diversity, and the authors believe this to be the critical component. We do not take so strong a stance, as evidenced by our novel extension, jackAL; the learners in a jackAL ensemble are all fairly similar. However, ACTIVE-DECORATE does boast the advantage of using the entire training set at each time, as it relies on generating random training examples. The authors also posit that DECORATE will out-perform bagging, AdaBoost, and random forests on small datasets; although this may be true for the pure algorithm, we do not find this to always be the case for the active learning variants. Melville and Mooney ran a very similar test suite to the one used for QBag and QBoost as well. Additionally, they use a different disagreement metric, JS-divergence, which is a measure of similarity between probability distributions as well as the majority vote metric from Abe and Mamitsuka. They find that JS-divergence chooses examples to strength the model’s certainty, but the majority vote method maximizes the margin (the entropy minimization procedure we use reduces to this method in the binary classification case [1]).

As mentioned at the beginning, the jackknife is an early resampling method that involves taking leave-one-out samples of the training data. We can then build ensembles from these subsets. There are a few things to note about this process. Obviously, using the jackknife will result in a very large ensemble (size of n-1), which requires training many models. Thus, this procedure is very computationally expensive, and gets more so as the size of the dataset increases. In return, the jackknife’s many classifiers will reduce the impact of the statistical noise in the data quite a bit, as the effect of each and every point is considered. Additionally, the large number of subsets offered by the jackknife causes it to be suitable for when the size of the sample is small, such as in the early stages of the active learning problem. However, as the size of the sample increases, the jackknife becomes less and less feasible and useful, as the removed example has less and less of an effect. A modification of the jackknife, jackknife-k, is discussed in Wu 1986 [8], which simply deletes k elements of the set, rather than 1. Note that this method is analogous to bagging with many bags, which we know to be functional. However, in QBag, the number of bags is held to be constant. Jackknife-k provides us with an opportunity to vary k as a function of the size of the training subset. This allows us to reap the benefits of leaving one sample out when the sample size is very small, but increasing the size to have more of an effect on the ensemble when the number of samples is larger. Thus, we propose a new algorithm: jackAL (jackknife Active Learning), that implements jackknife-k where k obeys some schedule.

```
Algorithm 2: jackAL
Input: X, the training data
k, the number of elements to delete or
schedule, a function of |X| that returns a value for k
n, the number of samples allowed to be queried
let the training corpus be a set containing one example
for i < n; i++
run jackknife-k(X,k,schedule) on the current corpus
calculate the disagreement of the ensemble
query the best point to add to the corpus via disagreement
return the current models
```

Prediction is done the same was as for ACTIVE-DECORATE. The choice of the schedule is still under testing. However, we hypothesize that monotonically positive functions that “level out” at around 10-20% of the size of the dataset would behave optimally, as a large k in the smaller cases could create poor ensembles, and a too small k when the dataset is small will likely have little effect other then smoothing statistical noise, the distribution of which we are overly interested in. As of now, the log10 function is chosen as the default behaviour. Note that this procedure is could be reframed as a modified QBag function, where the number of bags is variable as a function of the input size. We choose to stick with a jackknife implementation because we find that a small value of k is more optimal in the early regime of active learning with few samples, and this procedure is more analogous to the jackknife-1 then a bagging procedure with many bags. Additionally, the choice of Gaussian Naive Bayes (GNB) as the learner for continuous functions has not be robustly tested. As GNB tends to return probabilities very close to 0 or 1, it perhaps is not the optimal choice for a model like jackAL or ACTIVE-DECORATE. A better choice might be logistic regression, as it returns more “smoothly” distributed probabilities. As an aside, the predict_proba function from sklearn, which was used to obtain the probabilities, is known to be rather unreliable for models like GNB as well. We feel that jackAL works because it a) effectively minimizes statistical noise and b) chooses the right query points in the case where little data is available. Statistical noise is minimized because of the leave-k-out properties of jackknife-k; this enables models to reduce this pesky noise rapidly. Note that this hypothesis conflicts with the belief of the authors of ACTIVE-DECORATE; since the individual |n-k| subsamples will all be quite similar, there will not necessarily be much diversity between the individual models. We hypothesize that this differing approach causes jackAL to be more stable then ACTIVE-DECORATE, in the sense that the overall accuracy does not “bounce” around during training, which can be seen in the results section. However, ACTIVE-DECORATE may cause higher “spikes” of intermittent accuracy as well, in contrast to the slow-but-steady jackAL. We also posit that jackAL is a good choice because it chooses well with a small dataset. This makes sense; as jackAL constructs many models, it may be able to “squeeze” each drop of information out of a very small dataset. We additionally believe that this behavior of promoting a strong-start is very beneficial; this is analogous to the model picking up some form of momentum, although we have no way to empirically prove this or the earlier hypothesis. Nevertheless, we do find jackAL to be a competitive active learning method.

In summary, the implemented methods from the literature for this study are QBag, QBoost, DECORATE, and ACTIVE-DECORATE. Additionally, the novel extension jackAL was implemented as well, in the original jackknife, jackknife-k, and scheduled jackknife-k forms. All code is written in Python 3.7. The model chosen to test the ensembles is Naive Bayes, using a multinomial distribution for categorical values and a Gaussian distribution for continuous values. The particular implementation used is from the scikit-learn library; our pure Python implementation proved to be too slow, especially in the case of ensembling, when the model must be retrained many times. The entropy function from scipy was additionally used for the disagreement metric. The NumPy library was used extensively to speedup computation times. Additionally, an optimization was made that applied to all algorithms: instead of looping through the entire query set, a smaller subset was sampled with replacement according to a uniform distribution. This optimization was suggested by Abe and Mamitsuka and implemented across the board for comparison’s sake.The size of this sample set is a parameter on all functions. This optimization is a significant speedup, and appears to have minimal impact on the performance. This may be dependent on the data, as it inherently assumes that the optimal query examples are mostly non-unique; that is, the samples form clusters of optimal examples, and we seek to select an arbitrary sample from that cluster. This concept is further explored in the Density Weighted Learning section in this report. All other code written for this work is the sole work of the authors; no additional references other than the original papers were used outside of the previously mentioned. In the future, all methods would experience a massive speedup with the introduction of parallelism. The problem is trivially parallel in some places, such as when training the n-k models when using jackAL. Additionally, the code could be further optimized in a few places. The experimentation and code to generate figures for this report can be found in a Jupyter notebook, and the particular functions are all separated in a .py file.

The results of this experiment were quite promising. All methods of active learning performed surprisingly well. Such a large speedup was not anticipated before the start of this project. The results are delineated in Figure 1. All models appear to achieve similar training accuracy to a naive Bayes model trained on the whole dataset with vastly fewer required examples. Amongst the models, QBag and jackAL appear to be the most competitive, in contrast to the findings of the original authors of QBoost, who claimed that it was better then QBag. In fact, QBoost appeared to diverge rather then converge. This strange behaviour is due to the introduction of too much randomness; if the size of the query set (resampled at every iteration) is doubled from the 50 to 100, QBoost performs comparably to QBag. This example is left in to illustrate the importance of hyperparameter tuning. The jackAL method posited in this study appears to be quite good, and appears to be very consistent after around 10 examples seen on the Volcanoes dataset, which is a discrete dataset. The second dataset tested is the breast cancer dataset included in sklearn, which is a continuous, binary classification dataset with 569 samples. Notably, QBoost appears to oscillate quite a bit, as opposed to algorithms like QBag or jackAL. All active learning models are a vast improvement over random selection, as random selection only obtains ~50% accuracy by the time the other models converged. The voting dataset was also tested, but since random selection causes the model to converge within ~5 samples, it was deemed too easy for comparison purposes.

In conclusion, QbC appears to be a competitive, albeit computationally expensive, method. We foresee this type of learning being useful when obtaining more examples is expensive or time consuming; in the future, we hope to apply this method to degradation of photovoltaic modules that undergo artificial weathering. As the artificial weathering process is long and costly, it would be useful to know what kind of samples would be useful when putting the next batch into simulated exposure. All methods of QbC reviewed in the literature, including QBag, QBoost, and ACTIVE-DECORATE were shown to be effective on our test datasets, and the proposed novel extension, jackAL, performed as well or better then the known algorithms. Future work could include exploring different scheduling functions for jackAL and different disagreement functions then entropy.

```
[5] N. Abe and H. Mamitsuka, “Query Learning Strategies Using Boosting and Bagging.,” Madison, Wisconsin, USA, Jan. 1998, vol. 15, pp. 1–9. Accessed: Dec. 01, 2020. [Online].
[6] R. L. Milidiú, D. Schwabe, and E. Motta, “Active Learning with Bagging for NLP Tasks,” in Advances in Computer Science, Engineering & Applications, Berlin, Heidelberg, 2012, pp. 141–147, doi: 10.1007/978-3-642-30111-7_14.
[7] C. Körner and S. Wrobel, “Multi-class Ensemble-Based Active Learning,” in Machine Learning: ECML 2006, Berlin, Heidelberg, 2006, pp. 687–694, doi: 10.1007/11871842_68.
[8] C. F. J. Wu, “Jackknife, Bootstrap and Other Resampling Methods in Regression Analysis,” Ann. Statist., vol. 14, no. 4, pp. 1261–1295, Dec. 1986, doi: 10.1214/aos/1176350142.
[9] D. Lewis and W. Gale. A sequential algorithm for training text classifiers. In Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval, pages 3–12. ACM/Springer, 1994.
[10] D. Lewis and J. Catlett. Heterogeneous uncertainty sampling for supervised learning. In Proceedings of the International Conference on Machine Learning (ICML), pages 148–156. Morgan Kaufmann, 1994.
```

The first time I encountered finite state machines was in a logic and design class. In this context, state machines arose naturally as a model for a simple computation. Here’s a quick example:

Problem: Design a 3-bit binary counter with counting sequence 0,1,2,3,4,5,0,1,… This seems a simple enough problem. Here’s my sloppily-scrawled solution:

If the clock ticks 1, increment the counter. Once 5 has been reached, wrap back around. In this context, a FSM has been used to informally describe the structure of some circuit, but it only appears to matter for human context. Indeed, the implementation of the FSM in hardware was more based on a truth table and Karnaugh map simplification then the structure of the FSM.

Thus, in this case, FSM were a useful tool to describe the overall structure of a problem, but leaves implementation details to a formalized method. The second place I encountered FSM is a similar story.

For my networks class, I had to understand (and write!) an implemention of the Transmission Control Protocol (TCP). TCP is used for pretty much every Web transaction where reliability is important, and is implemented at the operating system level via Unix sockets. The specification for TCP is laid out in the RFC. Although the RFC never mentions FSM directly, it does lay out a sequence of states and state transitions for TCP, beginning on page 20. TCP can therefore be implemented as a state machine. TCP is based on the principle of reliable data transfer, which attempts to adjust for network errors such as flipped bits or dropped packets (this is what differentiates TCP from a fire-and-forget protocol like UDP). When designing a rdt protocol like TCP, it becomes useful to draw a FSM for both the send and receive side, as show below.

Unlike my experience with circuity, the high level of detail provided by the FSM is already quite useful; clearly, we can just place the conditions on this diagram into a series of if-statements and let the machine loop. This use case hints at how a informal model can be used to describe the structure of a program.

The next place I encountered finite state machines was in a class on theoretical computer science. Naturally, the first thing that must be done is construct a theoretical computer. A finite state automata is one of the simplest models we can use (in contrast to pushdown automata and Turing machines). To give a formal definition:

A finite state automata M is a 5-tuple (Q, ∑, δ q, F) where

- Q is a finite set of states
- ∑ is a finite set called the alphabet
- δ is a transition function
- q is the initial state
- F is the set of accept states.

Informally, a finite state automata is a set of states and a transition function, that work on some “alphabet”. Let’s make this more real.

Imagine we have returned to the Stone Age of computing, and all we have on hand is a infinite tape for input, a “head” processing unit that can read from the tape, and a machine that threads the tape past the head. Additionally, the machine has a patient “operator” that knows what to do for every character that the head reads in from the tape. This theoretical machine is (surprisingly?) a very useful model for computation!

As it turns out, the type of problems that a FSA can solve are known as regular problems. Generally, problems are called languages, where the elements of the language are solutions to the problem. This is a useful construct throughout computability and complexity theory. Now let’s use this theory for something, that something being everyone’s favorite problem-solving method of “dude, just use a regex”. Regex, short for Regular Expression, is a much-feared method of string matching, on account of it’s somewhat difficult syntax. However, a landmark result in language theory is that regular expressions are actually equivalent to finite state automata! That is, a regex exists if and only if you can construct a FSA that can compute it. The proof is quite interesting, but it’s much simpler to use nondeterminism (magic!), which will hopefully be discussed in a future post. Let’s give a concrete example.

Although regex syntax can be frightening, this simplified version has only two symbols, * and implied concatenation. The * symbol simply means “any number of,” and the concatenation is just “these must be next to each other on the tape”.

So how does our idea of a FSA help us in figuring out a regex? A common technique in FSA-programming is to imagine yourself as as the all-knowing “operator” that makes decisions based on what he or she reads from the tape, one at a time. This is very helpful when considering more complex regex that may be less trivial to write out in a few lines of English.

In order to justify my joke at the beginning (“After all, regular languages are context-free”), I’ve got to explain it. Essentially, imagine your FSA and add another tape; this time, we’ve made some technological progress, and can now write to this tape. This tape is a stack, meaning that it’s a first-in-first-out queue. Other then that, it’s the same thing as the FSA. We call the languages decided by these machines context-free, and the machines themselves pushdown automata (after the stack). Clearly, our new machine is more “powerful” then our FSA, and can do anything the FSA can do (and more!) Thus, all regular languages are also context free! I’ll save the implications of this for a later post, but now you can at least pretend to laugh.

In summary, FSA are pretty neat, and pop up in all sorts of cool places. Honestly, each of the topics covered today deserve a post all on their own. The purpose of this post was to document my “hey, I’ve seen that before” feeling, and hopefully connect a few neurons in my own head, as I’ve found relating seemingly-disparate topics is quite good for digging deeper into it. Next up is probably nondeterminism, which is probably my favorite computer science “hand wave” and also my favorite justification for forgetting to show my work.

]]>In my recent research, I’ve stumbled across the fascinating topic of hulls. For simplicity, we can define a hull to be an n-sided polygon that encloses all points p ∈ S, a set of k points. The most common type of hull is a convex hull, which is the smallest convex polygon that contains S, in contrast to the more rarely seen concave hull, which we can define to be the smallest polygon that encloses S. The difference between the two is, of course, that a convex hull is constrained to be an actually convex shape, whereas the concave hull has no such restriction.

There’s actually a ton of interesting applications of hulls. One particularity novel use is described in this cool blog post, which uses an alpha shape approximation to find contours of places from geotagged data. Another common use of a hull is to postprocess the result of a clustering algorithm, such as k-means or DBSCAN. This is the application I was originally concerned with.

The easiest type of hull to use is the convex hull. There’s several Python implementations of the convex hull, included in projects like scipy, OpenCV, and Shapely. The convex hull is quite intuitive, conceptually. Let’s dive into some code.

```
import matplotlib.pyplot as plt
from sklearn.datasets.samples_generator import make_blobs
from scipy.spatial import ConvexHull
X, y = make_blobs(n_samples=100, centers=1, n_features=2, random_state=0)
hull = ConvexHull(X)
for simplex in hull.simplices:
plt.plot(X[simplex, 0], X[simplex, 1], 'k-', color = 'red')
plt.scatter(*zip(*X), marker = '.', color = 'black')
plt.show()
```

Output:

Cool, that was easy! We’ve succesfully associated some sort of area with these points, which we can find with various methods. Scipy in particular has already done this for us, so there’s no need to do it explicitly. Mathematically, a common method is through Green’s Theorem, which you might remember from vector calculus.

The problem is that it a convex hull can be a (surprise!) bad approximation for a concave shape. Ideally, a concave hull would look more like the blue line:

This is clearly much more accurate, although still an approximation. But how do we get here?

Unfortunately for us, it ain’t as easy as the convex hull. None of the previously mentioned libraries have a built-in function for the concave hull, and for good reason. The problem is that finding a perfect concave hull is actually NP-complete. If you’re not familiar with the notion of NP-completeness, it essentially means that the problem is incredibly difficult to solve (Non Polynomial time!) and (the cool part) if we solve one NP-complete problem, we’ve solved them all!

The process of determining if a problem is actually NP-complete is termed a reduction; that is, we mean to transform the problem into something else that we know is NP-complete. Therefore, every problem in the class is merely a rephrasing of the rest. In this case, we can fairly trivially show that finding a concave hull reduces to a problem called the “min-area traveling salesperson”.

The traveling salesperson problem is fairly simple to describe. From Wikipedia:

“Given a list of cities and the distances between each pair of cities, what is the shortest possible route that visits each city and returns to the origin city?”

We can visualize this graphically like so:

That looks like something we recognize! We’ve actually enclosed the minimal area, which was what we defined a concave hull to be earlier. Although our visual intuition is enough for the purposes of this blog post, you can find a more formal proof in papers by Fekete et al. and Asaeedi et al..

As a side note, for those of you who know quite a bit about complexity theory (and before I get emails to correct me), the traveling salesman problem is actually NP-hard, rather then NP-complete because it’s not technically a decision problem. However, the min-area traveling salesman is, as we can reduce it to finding a Hamiltonian cycle in a planar digraph, which is NP-complete as shown by Garey et al.. If I’ve still got this wrong, please let me know!

Back to the matter at hand: there’s not really an easily available package for concave hulls because it’s actually hard to compute one.

Now we delve into the world of approximations. Fortunately for us, some really smart people has already thought about this in numerous blog posts. I particularity like this one by Sean Gillies, who wrote Shapely/fiona and this post by Kevin Dwyer, who expanded on Sean’s ideas and gave a really nice explanation.

As both of those posts provide excellent explanation, I won’t repeat it here. However, I have taken the liberty of creating a small class based on Sean’s and Kevin’s work for my own personal use to streamline the process the bit (and generate the figure with the concave approximation above!) You can find it on my GitHub if you want to play around with it a bit.

]]>`cat`

, although fast, does not handle .csv’s in any special way, and if the .csv is not short and simple, can result in unintelligible output. I was recently diving through my organization’s .bashrc and found this handy script.
function pcsv() { cat $1 | sed -e 's/,,/, ,/g' | column -s, -t | less -#5 -N -S }

which results in some pretty nicely formatted output.

Thanks to whichever fed up grad student that wrote this.

]]>