AllenNLP – An open-source NLP research library, built on PyTorch

TekMol · on Sept 16, 2017

Wow, is this really state of the art?

    Joe did not buy a car today.
    He was in buying mood.
    But all cars were too expensive.

    Why didn't Joe buy a car?

    Answer: buying mood

I think I have seen similar systems for decades now. I thought we would be further along meanwhile.

I have tried for 10 or 20 minutes now. But I can't find any evidence that it has much sense of syntax:

    Paul gives a coin to Joe.

    Who received a coin?

    Answer: Paul

All it seems to do is to extract candidates for "who", "what", "where" etc. So it seems to figure out correctly that "Paul" is a potential answer for "Who".

No matter how I rephrase the "Who" question, I always get "Paul" as the answer. "Who? Paul!", "Who is a martian? Paul!", "Who won the summer olympics? Paul", "Who got a coin from the other guy? Paul!"

Same for "what" questions:

    Gold can not be carried in a bag. Silver can.

    What can be carried in a bag?

    Answer: Gold

galenko · on Sept 16, 2017

Sadly, the NLP world is full of hot air. I've seen so many companies get funding for complete "written by a 12-year old" dogshit "industry leading IP", it's not even funny anymore.

The hype has gone down and some are actually doing great work, but 90% of the people who say they do NLP/AI stuff don't even fundamentally understand what NLP/AI is.

fnl · on Sept 16, 2017

Sadly, I'd fully agree to this. Things are possible now that were not 10 years ago. But mostly, only performance increased on things we could do 10 years ago, while hardly any new abilities came along. Machine translation, linguistic parsing, etc. came a long way. But we still can't do satisfactory abstractive summarization or create a conversational agent for more than an extremely narrow domain. Yet, at least the things we can do can be done at levels that are "production ready".

tanilama · on Sept 16, 2017

I still hold hope. However, it seems naively exploit the function approximation capacity we have with deep learning can only go that far to understand our own language.

Maybe we need to look back and start from beginning and ask ourself: How does human learn, exactly? How do we learn with so few examples? How do we jointly learn image/audio/video/language with only one brain?

amelius · on Sept 16, 2017

Perhaps we should consider working only on techniques which improve as more computational power is thrown at it.

glup · on Sept 16, 2017

Computation won't help if we don't have the right representations. Arguably computation can help us discover the right representations but the space of possible representations is very, very large.

glup · on Sept 16, 2017

All of the above require fairly complex world knowledge as well as an explicit representation of a scene. There is minimal leverage for lexical distributional statistics in these cases—arguably the one thing we have had major success in using (e.g. building vector space word representations, like Word2Vec; finding the highest probability parse tree for an utterance).

senatorobama · on Sept 16, 2017

A key tenet of supervised learning is that you will only ever do as well as what's in your training set.

TekMol · on Sept 16, 2017

They state that it "achieved state-of-the-art accuracies on the SQuAD dataset (Wikipedia sentences) in early 2017"

So I would assume it has bean heavily trained?

yorwba · on Sept 16, 2017

If you look at example questions from the dev set[1], you'll realize that they all use the same words as the sentence containing the answer. Additionally, the topics aren't everyday stuff, but something you'd write a Wikipedia article about. So I guess the model just learns to find the sentence most similar to the question and then selects an answer based on a coarse categorization, which fails when it is presented with unseen situations.

Your example works if you rephrase the question to be more similar to the text:

  Paul gives a coin to Joe.
  Whom does Paul give a coin to?
  Answer: Joe

You can cut the question down to "gives to?" or "coin to?", because that's enough to single out the answer. But as soon as you use s̶y̶n̶o̶n̶y̶m̶s̶ (EDIT: related words) that are not recognized (like "receive"), you have no chance of getting a meaningful answer.

[1] https://rajpurkar.github.io/SQuAD-explorer/explore/1.1/dev/

adrianbg · on Sept 16, 2017

The "Who did What" dataset seems much better in this respect:

Passage: Britain’s decision on Thursday to drop extradition proceedings against Gen. Augusto Pinochet and allow him to return to Chile is understandably frustrating ... Jack Straw, the home secretary, said the 84-year-old former dictator’s ability to understand the charges against him and to direct his defense had been seriously impaired by a series of strokes. ... Chile’s president-elect, Ricardo Lagos, has wisely pledged to let justice run its course. But the outgoing government of President Eduardo Frei is pushing a constitutional reform that would allow Pinochet to step down from the Senate and retain parliamentary immunity from prosecution. ...

Question: Sources close to the presidential palace said that Fujimori declined at the last moment to leave the country and instead he will send a high level delegation to the ceremony, at which Chilean President Eduardo Frei will pass the mandate to XXX.

Choices: (1) Augusto Pinochet (2) Jack Straw (3) Ricardo Lagos

https://arxiv.org/abs/1608.05457

TekMol · on Sept 16, 2017

Yes, that might be a more accurate description. It picks the "who" thingy from the most similar context.

With zero further understanding as it seems:

    Paul gives no coin to Marray. Paul gives a coin to Joe.
    Who got something from Paul?
    Anser: Marray

    Paul gives no coin to Marray. Paul gives a coin to Joe.
    Who received a coin?
    Answer: Paul

rspeer · on Sept 16, 2017

Heavily trained on SQuAD questions. There are lots of models out there that are very good at recognizing SQuAD questions, and reverse-engineering the predictable ways that the Turkers who wrote the questions pulled the information out of the paragraph -- allowing them to answer the question without ever understanding it. https://arxiv.org/abs/1707.07328

halflings · on Sept 16, 2017

The difference with new NN-based systems is that they are trained end-to-end, learn the syntax and some form of "reasoning". Check Memory Networks, by facebook, for example (two NNs, one for "reasoning" and one for storing long-term data, quite impressive).

Now, it's still an area of active research... and I'm not sure what "state-of-the-art" means for this library, somebody said that they rank #27th in some commonly used dataset.

reachtarunhere · on Sept 16, 2017

I am working with Memory Networks as part of my thesis. If you actually read and implement the FB paper you realise that the system is not half great as the demo shows. It is as bad as the top comment here. Yes we have come far along but frankly the hype is too high.

msamwald · on Sept 16, 2017

According to the website they use the BiDAF model, which as a single model does not produce state-of-the-art results on the SQuAD benchmark. It is ranked 27th here: https://rajpurkar.github.io/SQuAD-explorer/

rubyfan · on Sept 16, 2017

I can’t imagine that the human mind works even remotely how these NPL systems work. Grammars, tokenizing, matrices... there must be a better approach

make3 · on Sept 16, 2017

Try the more complicated text sources. It is able to parse them and still answer questions reasonably well

mamp · on Sept 16, 2017

This is very brittle: it works really well on the pre-canned examples but the vocabulary seems very tightly linked. It doesn't handle something as simple as:

'the patient had no pain but did have nausea'

Doesn't yield any helpful on semantic role labeling and didn't even parse on machine comprehension. If I vary it to say ask 'did the patient have pain?' the answer is 'nausea'.

CoreNLP provides much more useful analysis of the phrase structure and dependencies.

sanxiyn · on Sept 16, 2017

In "Adversarial Examples for Evaluating Reading Comprehension Systems" https://arxiv.org/abs/1707.07328, it was found that adding a single distracting sentence can lower F1 score of BiDAF (which is used in demo here) from 75.5% to 34.3% on SQuAD. In comparison, human performance goes from 92.6% to 89.2%.

andrew3726 · on Sept 16, 2017

There's a blog post (the morning paper) about this: https://blog.acolyer.org/2017/09/13/adversarial-examples-for...

vbuwivbiu · on Sept 16, 2017

"the squid was walked by the woman"

"what is the fifth word in that sentence ?"

Answer: squid

strin · on Sept 16, 2017

We need more demos of AI models: there is what people claim their model does, and there is what the model actually does.

wyldfire · on Sept 16, 2017

How does this compare with spacy?

glup · on Sept 16, 2017

Different set of tasks. SpaCy is focused on bread-and-butter tasks like tokenization, part of speech tagging, and dependency parsing (not to say that these are easy, but that they are things people have been working on a long time). AllenNLP seems focused on distributing relatively recent neural models (last few years) of more complex language understanding like labeling semantic roles (agents, patients, etc.) and identifying textual entailments (=mining facts from a sentence). It is not great at these tasks, because this is v. difficult and a very active area of ongoing research.