This is a very small amount of boilerplate around the golang.com/x/net/html package. If you need the huge feature set of goquery, use that. But I find this pretty suitable for my day to day problems.
rows := scrape.FindAll(table, scrape.ByTag(atom.Tr))
cols := []*html.Node{}
for _, row := range rows {
// Find returns the first result
col, ok := scrape.Find(row, scrape.ByTag(atom.Td))
if ok {
cols = append(cols, col)
}
}
> You can wip up REST service very easily that wraps sk-learn predictor and I would bet it's actually much easier to do than writing PMML exporters.
So as it turns out I spend my days building the very product you're describing (yhathq.com; a REST API-ifier for R and Python). The scikit-learn community alone are a wonderful group who do a hell of a job. It's kinda crazy that most products won't let you use that awesomeness and instead choose to build out their own machine learning libraries to work within their system.
This article got passed around the office this morning and it seems to encompass the general theme of most ML tools. They empower you to do cool things with machine learning/general data analysis, but at the expense of being able to use the libraries that most people use to do machine learning/general data analysis. Don't know if I'd consider that poor design, but yeah, it's definitely a tradeoff.
Hmm, maybe I should be reaching out to airbnb's data science team?
That ml problem is more for example than for rigor. In fact that particular problem would probably be better suited for other algorithms (eg, random forest).
My background's in biomedical imaging, so I'm quite fond of problems with skewed class distributions. Though I didn't have time to explore this particular one further.
The code's all openly available if you want to give it a go though :)