The research in this space is very conflicting about what methods actually work. In the graph on the page, the ETS model (basically just a weighted moving average) outperforms multiple, recent deep learning models. But the papers for those models claim they outperform ETS and other basic methods by quite a bit.
You can find recent papers from researchers about how their new transformers model is the best and SOTA, papers which claim transformers is garbage for time series and claim their own MLP variant is SOTA, other papers which claim deep learning in general underperforms compared to xgboost/lightgbm, etc.
Realistically I think time series is incredibly diverse, and results are going to be highly dependent on which dataset was cherry-picked for benchmarking. IMO this is why the idea of a time series foundation model is fundamentally flawed - transfer learning is the reason why foundation models work in language models, but most time series are overwhelmingly noise and don't provide enough context to figure out what information is actually transferrable between different time series.
> Realistically I think time series is incredibly diverse, and results are going to be highly dependent on which dataset was cherry-picked for benchmarking
That's exactly right it is bearly one step above playing "guess which number I'm thinking of" and acting amazed that if you play long enough you'll witness an occasional winning streak.
My god, model has learned to read your mind! ;)
This smacks of when very serious soviet scientists ran Telekinesis experiments and all manner of cold reading and charlatans.
https://en.wikipedia.org/wiki/Telekinesis
Somebody should come up with a decoder-only foundation model for bending-spoons.
However I cam imagine a kind of meta-learning foundation model that basically has a huge internal library of micro-features, and when you put a sequence into it, it matches those features against the sequence and builds up a low-noise summary of the data that it can use to make predictions.
That's of course heavily anthropomorphized, but it seems potentially in-scope for a transformer model.
The real problem with time series data is that you can't predict the future. Images and text are relatively homogeneous and exist within a kind of restricted space. "Time series" in general however could be just about anything, and there's not as much reason to believe that something like a "grammar of time series" even exists beyond what we already can do with STL etc.
> incredibly diverse, and results are going to be highly dependent on which dataset was cherry-picked for benchmarking
This naturally comes to multi-model solution under one umbrella. Sort of MoE, with selector (router, classifier) and specialized experts. If there is something which can't be handled by existing experts then train another one.
the point is it's a fundamentally flawed assumption that you can figure out which statistical model suits an arbitrary strip of timeseries data just because you've imbibed a bunch of relatively different ones.
as long as you can evaluate models' output you can select the best one. you probably have some ideas what you are looking for. then it's possible to check how likely the output is it.
the data is not a spherical horse in the vacuum. usually there is a known source which produces that data, and it's likely the same model works well on all data from that source. may be a small number of models. which means knowing the source you can select the model that worked well before. even if the data is from alien ships they are likely to be from the same civilization.
I'm not saying that it's a 100% solution, just a practical approach.
it's a practical approach to serve normalized data but monitoring systems are most valuable by making abnormal conditions inspectable. proper modeling of a system has this power
so while this seems persuasive, it's fundamentally about normal data which yields little value in extrapolation
I think the next jump will come from neurosymbolic approaches, merging timeseries with a description of what they are about as input.
You can use that description for system identification, i.e. build a model of how the "world" works. This can be translated into a two-part network architecture, one is essentially a world-model-informed (physics-informed, as its often called in literature) part for the known-unknowns, the other one is a bounded error term for the unknown unknowns (e.g. dense layers, or maybe dense layers + non-linearities that capture the fundamental modes of the problem space for reservoir computing). The world model is revised with another external cycle of meta-learning, via symbolic regression.
The unknown-unkowns bit that I choose is designed as a shallow network that can be trained online (traditional methods, but I'd like to see if the forward-forward algorithm from Hinton would work well for short-term online adjustments), or by well-known tools like particle filters / kalman filters.
The non-linearities (and the overall approach in general) resemble physics-informed dynamic mode decomposition (piDMD), which show remarkable resistance to noise (e.g. salt & pepper).
If you have simple timeseries, and not very complex hierarchical systems that change over time and show novel modes that you haven't encountered before, then piDMD is likely enough for what you need.
---
Essentially, what I describe is a multi modal model for timeseries + a planning step. (AlphaGeometry to the rescue?)
---
Like you say, time-series in general have an incredibly complicated domain with comparatively few data available.
For example, real-world complex physical systems (industrial plants, but also large-scale software systems) may have replicas of the same components with complex behavior, and no/few shared dependencies for reliability or other constraints (e.g. physically apart). These can be captured by transformers. Training will be much faster if you initialize weights like I describe above, and share weights among replicas. The physical structure also creates particular conditions on the covariant matrixes and on the domain of higher-level timeseries (ultrametric spaces, which changes how measurement and frequency behaves there and can lead to great simplifications, but also errors if tools like FFT are applied blindly without proper adjustments; much like in operations research / planning problems, symmetries are sought after to reduce complexity).
On the other hand, the next level that compose these building blocks often have graph structure and sometimes scale-free networks (e.g. if they represent usage or behaviors, rather than physical systems). I think we'll see graph neural networks shine on this front.
There are likely other kind of behavior that I haven't encountered yet in my work.
I think overall, we'll see planning/neurosymbolic used at the highest-layer, graph neural networks for scale-free networks and to optimize long-range connections (also when a dense model with dynamic covariant matrices would be too expensive to compute even in sparse form), and transformers or/with piDMD-like approaches for dense patches of complex behavior. I.e. graph models as a generalization for spatial locality to arbitrary spatial-like domains, transformers/piDMD or similar for sequence-/time- locality for arbitrary complex systems. (I wonder what kind of weights will they implement when trained together on problems that are fundamentally in the middle, where traditionally one would use wavelets... if you look a the GraphCast model by deepmind for weather forecasting, it looks quite similar)
"real-world complex physical systems (industrial plants, but also large-scale software systems) may have replicas of the same components with complex behavior, and no/few shared dependencies for reliability or other constraints (e.g. physically apart). These can be captured by transformers."
Could you elaborate on this please? On why the transformer architecture lends itself well to this?
The general idea is to exploit the structure of the system. Use it to pre-initialize connections (e.g. covariates, information flow as layer connections) between different parts, rather than learn it from data: make your network resemble the physical system it is modeling.
You can try pre-train a transformer to capture the behavior of a common part that is replicated, and then make replicas (sharing weights, or not depending on the problem) to train the whole ensemble. It works both for existing systems but also from high-fidelity enough simulations, or proxy systems that show the same range of behaviors (e.g. staging environments).
Even if the pre-train network part doesn't converge fully or capture everything, it can pre-condition the network and help training the whole ensemble faster.
---
For simple components, you can even just write your own simulations as custom NN layer (e.g. recurrent layers that take the current state and input, return the output and next state). It helps to avoid the performance bottleneck of going outside accelerators for simulations, or having to train too many small networks.
I'd generally just write my own recurrent layer, if the behavior is simple enough.
But you can also use existing code and tweak it cleverly: e.g. LSTM Cells can be pre-initialized to implement continuous-time markov chains, as a birth/death renewal process.
You can capture the behavior of a simple component in isolation, then use it in the whole. Either freezing it and adding an error correction layer (e.g. if the frozen part is quite big and replicated, you can share weights more efficiently), or not freezing it and letting it train further.
You can impose bounds on the complexity of the error correction, very much like LoRA you can design it as a low-rank matrix decomposition; together with the right loss (e.g. L1 or huber), it's another technique to ensure that the error correction doesn't drift too much away from the behavior that you can expect from the physics of the system (and when it no longer converges, it is a good indicator that you have model drift and new behavior is coming up... that's a way to implement robust anomaly detection).
---
PS: I do know about the bitter lesson... the problem with that is that it assumes you can throw more data and more training time to problems, and that they are stable or similar to what is in your data, this is not always the case.
I've worked in many research scientist/MLE roles over the years and I haven't met any IC who has this much of a fixation on AI being a moral evil/good. The ones who do are inevitably nontechnical hucksters, usually just trying to get money or self-promote.
IMO if you're going to profit off open research, you should at least make your own work available for other researchers. The white paper has 10 pages of performance benchmarks but 5 sentences on methodology.
It's a neat historical anomaly but it is not very practical for most people to bank at. You have to be a North Dakota resident, they have one physical office in Bismarck North Dakota, and they do not provide online banking.
> Because of our unique structure as a state-owned bank, it is the Bank’s policy not to compete with the private sector for retail deposits. Therefore, convenience products such as debit cards, credit cards or online bill pay are not offered. BND has one location at 1200 Memorial Highway in Bismarck, North Dakota.
Dplyr actually supports some really cool join functionalities that I wish were in SQL implementations, including:
- Ability to specify whether your join should be one-to-one, many-to-one, etc. So that R will throw an error instead of quietly returning 100x as many rows as expected (which I've seen a lot in SQL pipelines).
- A direct anti_join function. Much cleaner than using LEFT JOIN... WHERE b IS NULL to replicate an anti join.
- Support for rolling joins. E.g. for each user, get the price of their last transaction. Super common but can be a pain in SQL since it requires nested subqueries or CTEs.
I know many people who've gotten big raises over the last 2-3 years, including individuals who have more than doubled their pay. And they always attribute their raises to hard work, job hopping, or climbing the promotion ladder. At the same time, they'll also act like the 20% or whatever inflation we've had over the past few years is equivalent to the government stealing their hard earned money.
From a macro perspective, the same factors that drove inflation is what drove their big pay hikes. But most people only see that "I earned more money and now I'm being screwed by inflation". These are often the people who spend like crazy while simultaneously complaining about how bad the economy is!
Every tech company minus the few doing core research have been doing this for at least half a year. Generate training data with GPT4 or sometimes even 3.5 -> use it to do a QLoRA finetune on a llama or mistral base -> roll it out as a "proprietary" AI model -> management claims a big win and talks about how they're leaders in "[industry name] AI".
It is remarkably easy - it takes practically zero knowledge of ML and can usually be done with less than <$1k of cloud compute costs. The issue is that for most realistic tasks you can expect to end up with something roughly on the level of GPT-3.5, and its actually really hard to compute with GPT-3.5 on a cost level, at least if you use cloud GPUs.
> Every tech company minus the few doing core research have been doing this for at least half a year
I'm assuming you mean all those new 'AI wrapper' startups popping up.? I wouldn't say "every tech company". But yeh it seems incredibly easy, definitely an easy win and leaders get to feel ahead of the curve on AI.
I agree, in fact I would wager the opposite, that most who claim to be a tech company are simply using OpenAI or another vendor with a turnkey API for things launched recently. Over time I expect more use of fine-tuned models, but fine-tuning is not easy, especially if your goal is GPT parity (or better).
The thing I don't understand about this strategy is that it itself shows that there really is no money to be made here. I mean it's a pretty obvious giveaway that:
1. they don't have the resources to build their own technology and probably never will
2. even if they did have, the best they could do is come up with something very similar to OpenAI's GPT, i.e. a (somewhat) generic AI model. This means that OpenAI can also easily compete with them.
All these companies are doing (if anything) is that they test the market for OpenAI (or Google, MS) for free.
The flaw in your assumption is that perfect tech or tech powerhouses win. I mean, sure when they do, they win big; but the endgame for b2b SaaS is mostly M&A, powered by sales, which is mostly down to c-suite relationships and perception of being one among the market leaders ("nobody ever got fired for buying IBM").
If you can move fast, deliver, expand, and raise money, there's a good chance the AI wrapper lands a nice exit and/or morphs into a tech behemoth. Those outcomes (among others), even if mutually exclusive, are equally possible.
So, if I understand you correctly, the business strategy for an AI wrapper company would be that they acquire customers quickly from a specific niche, build a name, while having very little custom technology and then get acquired by some of the larger players who do have the actual AI tech in-house. And, for them, it would be worth it for the brand/market/existing client base.
Assuming that the advance made in the meanwhile in AI doesn't eradicate the whole thing. I mean say some company builds a personal assistant for managers to supplant secretaries, they become the go-to name and then Google buys them in 2-3-5 years. Unless Google's AI becomes so good in the meantime that you can just instruct it in 1-2 sentences to do this for you.
> get acquired by some of the larger players who do have the actual AI tech in-house. And, for them, it would be worth it for the brand/market/existing client base.
The key is, if the incumbents truly feel they can't breach whatever moat, M&A is the safer bet over agonizing what if (I am thinking "git wrapper" startups that saw plenty competition from BigTech; remember Microsoft CodePlex, Google Code, AWS CodeCommit?). Given Meta's push and other prolific upstarts (OpenAI, Mistral), I don't believe access to SoTA AI itself (in the short term) will be an hindrance for product-based utility AI businesses (aka wrappers).
No, as far as I have seen, the "AI wrapper" companies have been clinging to GPT-4 a lot faster than other tech companies. Many bigger companies deploy GPT-4 very sparingly if at all.
Yes but I doubt anyone is going to get the Aaron Swartz treatment over it, especially when OpenAI's own models are no doubt generated by playing fast and lose with ToS. E.g. at least as early as 2018, StackOverflow's ToS said:
"Any other downloading, copying, or storing of any public Network Content (other than Subscriber Content or content made available via the Stack Overflow API) for other than personal, noncommercial use is expressly prohibited without prior written permission from Stack Overflow or from the copyright holder identified in the copyright notice per the Creative Commons License"
Ahhhh, yes. OpenAI's good old ToS... Where it's OK to break a ToS / copyright if you're OpenAI for the input to generate the output that, you (customer) don't own and can't cache. Because that would impact their revenue model, be more efficient (power and cost) and still leave them holding the bag after ingesting loads of content they never had a right to in the first place but have staked their claim that it's OK because there's a lot riding on their success.
And, oh by the way, they'll just change their ToS as it suits them for more revenue opportunities even when they stated they wouldn't do business with - oh you know nation state militaries. But - JK! Now we will because <enter some 1%er excuse here>.
It took me 30 seconds to read their TOS and confirm you're just making most of that up.
> As between you and OpenAI, and to the extent permitted by applicable law, you (a) retain your ownership rights in Input and (b) own the Output. We hereby assign to you all our right, title, and interest, if any, in and to Output.
It follows that your claim about caching violating OAI's terms is nonsense.
I think you missed my point. "Caching" output by training a more efficient / cheaper model with that output is in fact against their ToS. In my simple brain that is a form of caching, and I stand by my original post.
I've not made anything up. Your claim that I have is nonsense.
OpenAI changing their ToS for the military on a whim: https://archive.is/GILKl - for your enjoyment.
OpenAI ToS:
"What You Cannot Do. You may not use our Services for any illegal, harmful, or abusive activity. For example, you may not:
* Use Output to develop models that compete with OpenAI."
> I think you missed my point. "Caching" output by training a more efficient / cheaper model with that output is in fact against their ToS. In my simple brain that is a form of caching.
If that was your point, I'm pretty sure everyone missed it. No one is training models as a form of caching their previous responses. They want to improve the quality of responses they haven't generated yet. That's not caching.
> I've not made anything up.
You said customers don't own the output; they do. I said you made most of it up, and you did. Including your apparent retconning of your original point.
> If that was your point, I'm pretty sure everyone missed it. No one is training models as a form of caching their previous responses. They want to improve the quality of responses they haven't generated yet. That's not caching.
So... You didn't read the article of which you're commenting in?
> You said customers don't own the output; they do. I said you made most of it up, and you did.
You don't own it. If I own something, I can do whatever I want with it. This is just like your iPhone. You don't actually own it, because you can only do with it what Apple allows you to do.
> Including your apparent retconning of your original point.
Wow, enjoy your day. Your misunderstanding is, apparently, my "retconning". Maybe read the original piece you're responding to within the thread.
Did I read the article? You mean the tweet? If you're saying it supports your claim that fine-tuning a model is equivalent to caching, you are mistaken.
> If I own something, I can do whatever I want with it
BRB digitizing my entire media collection and uploading it to the public internet.
It will be interesting if the same court cases proving their use of everyone else's data make it fair use to use their machine output as training data. They're definitely in their rights to ban whomever but who knows if they have recourse beyond that?
Have the same question - I mean, for training an open source model with no monetization attached, not much Open AI can do besides ban the user, but they can make another account. For a company doing this with the intent to sell it as a capability... seems risky.
And even this analysis is optimistic as it doesn’t factor in the $$$$ it costs to hire a data scientist to fine tune the models. Just use off the shelf models with RAG until you really need custom models
I'm curious if you can actually get better than 3.5 though considering how meh it is at most applications. I'd be nice to know whether I could actually get a better model without the effort of trying this.
I've noticed a huge surge in negativity and pessimism on English-language social media within the last year or so, roughly corresponding with to the spread of LLM tech. I do wonder whether these people are mostly just bots.
This seems to be happening in English-language meat space too, so I don't think it's bots. I'm not sure what happened. The trend started in 2023 and seems to be ongoing, though I imagine people will get over it soon. I've heard it suggested that people are just in a funk about the economy, at least in the U.S.
I've noticed this as well, across pretty much all of social media that I use. I can't quite place my finger on it, whether it's just a reflection of the state of the collective feelings of people, or if it's mass manipulation. At this point it could be both.
I dont think the first part of this article is quite right.
1. I doubt many people would think of 9/11 as a "1 in 50 years" event if it didn't actually happen. If you had a year-by-country level dataset of every developed country post ww2, you'd have thousands of observations but none would have as many fatalies from terrorism as the US in 2001.
2. If a genuinely super rare event occurs (like one in thousands), its often more reasonable to think that theres been some fundamental shift in the world that you failed to recognize, rather than that you just got super lucky or unlucky to have lived through it.
I think your second point helps make sense of my problem with the not 1, but 2 threshold articulated.
> if it happens twice in a row, yeah, that’s weird, I would update some stuff
Why? Why is one event not an “update”, but 2 is? Shouldn’t each data point change your assumption proportional to the assumed chance and period of observation?
That seems more true to the author’s belief framework, but it wouldn’t make as as spicy a title.
737 Max 9s are currently grounded due to a single incident; should the FAA have waited for a second door to blow off?
Reading that myself, it sounds like a gotcha question, but under this entirely arbitrary 2-but-not-1 threshold, the answer seems like it should obviously be yes.
The rational framework the author is advocating for is all about probabilities and percentages, so it seems like a weird exception to carve out that there’s some hard line between 1 and 2 event occurrences. I doubt he would hold fast to it if pressed, which is fine.
> Reading that myself, it sounds like a gotcha question, but under this entirely arbitrary 2-but-not-1 threshold, the answer seems like it should obviously be yes.
I think that's because the question implicitly assumes that the threshold applies for all things and all purposes. It doesn't. First the threshold is about adjusting your baselines, and second even if the threshold were for when to pull the "stop everything" cord, it all depends on your specific goals. The FAA might have completely different goals and targets than someone else in some other industry. Or to put it another way, the FAA has grounded all 737 Max 9's after a single incident with no fatalities. As of Jan 16, 11 people have been killed in homicides in Chicago. By the same "one threshold for everything", the entire city should be on complete lockdown until such time as it can be made safe by the proper authorities.
On the other hand, if you assume that one could have a very low "stop the world" response threshold for "sudden mechanical failures leading to explosive decompression" of planes, and simultaneously have a higher "stop the world" threshold for "people dying in Chicago", then it seems completely reasonable that one could have a third different threshold for the number of mass shooters that come out of any given arbitrary social clustering that trigger the "I should re-evaluate whether these people are entirely sane" routines in your brain.
Thank you for phrasing it like that. I think maybe my hang up is that the author doesn’t allow that different people and groups can have different baselines and update responses. Implicitly it seems that if everyone is perfectly rational, all responses to an event would be the same, but that’s not really true due to our subjective human experience. In the shooting example the quote came from, the Left and Right he caricatures have different priors, knowledge, experiences, and motivations than the author, someone who likes to think of themselves as rational and aloof of politics. Of course they’re going to have a different response, but that doesn’t necessarily mean they’re acting irrationally.
It’s quite different. The incident with the door heavily implies a problem of airplane design. It makes sense to ground it.
What we knew at the time of the first crash of max 8 seemed to imply a pilot error. It wasn’t statistically significant. Only when another max 8 crashed soon after (I think soon enough to say “in a row”) was the max 8 grounded. If the second crash occured years after it wouldnt be significant.
You can find recent papers from researchers about how their new transformers model is the best and SOTA, papers which claim transformers is garbage for time series and claim their own MLP variant is SOTA, other papers which claim deep learning in general underperforms compared to xgboost/lightgbm, etc.
Realistically I think time series is incredibly diverse, and results are going to be highly dependent on which dataset was cherry-picked for benchmarking. IMO this is why the idea of a time series foundation model is fundamentally flawed - transfer learning is the reason why foundation models work in language models, but most time series are overwhelmingly noise and don't provide enough context to figure out what information is actually transferrable between different time series.