When it comes to analyzing text, artificial intelligence (AI) and machine learning (ML) have gone mainstream. These advances have caused a wave of excitement in how we understand what people are saying in written language. And in the race to make this tech available, our responsibility in understanding the implications of its use can get lost.
We recently sat down with Vanya Cohen, Research Scientist and Machine Learning Engineer to learn more about just that.
Q: Tell us a bit about yourself … how did you get into AI research?
I’m originally from Seattle. Before undergrad, I took a summer class at Stanford on artificial intelligence, and that’s where I had my first exposure to AI. I fell in love with it because we had a great instructor, and it was really cool at the time to be around graduate students who were interested in AI. Grad school and all that felt so distant at the time … I wasn’t really thinking about what I would do in the future.
Q: So this interest developed more once you were in school.
Yeah, I took opportunities in undergrad to nurture it whenever I could. I joined the video game club at Brown. Once I had the math background, I ended up taking a graduate level course called “Collaborative Robotics” with Professor Stephanie Telex, who runs the Humans to Robots lab. It focused on helping robots understand natural language.
At the time, deep learning – and neural networks for speech recognition – was in full swing, and speech recognition was finally able to actually parse what we were saying fluently. I did a project for the class which turned into me working in the lab. The project I started out with was instruction following in Minecraft. Basically, a virtual agent would follow natural language instructions to build things in the game.
Q: And that’s when you pursued a Master’s in AI.
In my graduate research, I researched helping robots understand the properties of shapes of everyday objects in terms of language. For example, if you talk about a tall chair or a
minimalist couch, can you get the robot to understand what you’re talking about in terms of the words that you used to describe the objects’ shapes.Q: Wait … how does a robot understand shape descriptors?
We can give the robot several objects and describe one based on its shape. For example, I’d be in a room with a robot and several couches, and ask it to go to the sectional or go to the couch with no arms. It can tell what you most likely mean based on the natural language description and sensor data of the objects. This project and my time as a research assistant culminated in a published paper, which I presented at IROS 2019. And I then ended up doing a project in language modeling, and that’s what the Wired article is about.
Q: Oh yeah – that article in Wired! How did the OpenAI project first come on your radar?
I was in the process of wrapping up the shape and language research, and OpenAI dropped a bombshell paper through a massive coordinated media rollout. The thesis of the paper was as follows: given some text, can a machine learning model predict what comes after? For example, auto-predict on a mobile device, where you can predict what a smartphone user is going to type, given some context letters or words. People have been experimenting with language modeling for a long while. In the last year or so, there have started to be some indications in the NLP community that if you’ve trained really big models with lots of text data, you could get almost human-like text generation. You could also use them in different ways for what we call downstream tasks. These are things like sentiment analysis and translation, like we’re working on at Luminoso. OpenAI was a pioneer in showing you can use language models for downstream NLP tasks.
Q: Interesting. What were OpenAI’s conclusions?
So the paper said, “Hey, we trained a much larger language model than anyone’s thought to train!” It’s a lot better, like, scary good in terms of how far ahead it is of anything else. You could give it a headline, and it could write an entire article. It isn’t perfect, but for the most part, if you just skim what the model writes, it reads like a real article. And it isn’t just for news … you can give it a passage of any kind of text and type out a question, asking what comes next, and it can write a prediction. It’s so good you can even type out a passage, ask to translate the passage to a different language, and it does so surprisingly well, especially considering the training data was filtered, though imperfectly, to only contain English!
Q: Wow. So what were you thinking when you saw this?
The crazy part is that you’re only predicting what’s next. You train a model to do that, and you get all this other functionality, which is pretty cool. But here’s why this became a huge media circus: OpenAI decided not to release a lot of details about how they trained their model and the weights of the trained models, because they felt it was “too dangerous”, or something to that effect. The fear was it could be used to write fake news, spam comments, or other text.
Another grad student, Aaron Gokaslan, and I thought to ourselves, “Hey, this is a little unprecedented that they’re not releasing their model or details.” We did some quick math on how much it would take to replicate. Even though it was a lot, it was still in the reach of most companies, countries, and intelligence agencies. If someone had the technical know-how, a person could replicate the results. And if we could definitely do it, then pretty much any agency or company could.
Q: So that’s when you went to your advisor.
Yeah. We thought that given the state of the field, a lot of people who could mitigate the potential threats of these models would be left out.
We thought if we can’t recreate the model, we can at least recreate the dataset, and someone else could take it and run with it to create the model and release it to the public. Based on the details they put out there, our own suppositions about what they left out, and our own knowledge, we created Open Web Text Corpus. It’s been used by researchers from Facebook, and others.
Q: How did you go about rebuilding the model?
OpenAI had taken all links from Reddit with more than three upvotes, following them back to the pages they linked to, and downloading the text from all those pages. That’s where we started. Most of this data is unusable, so you have to de-duplicate it, not scrape videos and HTML, and filter for English because what we needed was English-only text. And it’s important that you don’t grab content from sites like Wikipedia, because these are used in other NLP performance benchmarks. You shouldn’t test what you’re training on. We ended up with about 40 gigabytes of text, the same as OpenAI.
It would have taken years to train this model on our lab computers. We needed a “supercomputer”. So we went to Google, and asked nicely if we could get some Google Cloud Credits. They really want researchers to use their platform, so they agreed to it, and a few weeks later, we got the computing power on their cloud.
At this point, we reached out to OpenAI to clarify a few details they had left out of their paper, and they didn’t reply. So we continued training, and released the model in August. And right after that was when the Wired article was published.
Q: It looks like it took less than 6 months for you to rebuild the OpenAI project.
Yes. And OpenAI did respond to our emails immediately after we released. It was a little awkward at first, but we ended up having a great dialogue with them about their research. We’re also at the point where we’re amicable. They just had a different opinion about release norms in this instance. But they’ve been really nice and willing to engage with us and other researchers.
Q: What did you learn from this experience?
First and foremost, right now these text generation models are most useful to researchers, and probably second to corporations and organizations developing fun games like AI Dungeon 2. To OpenAI’s credit, they’ve done a bunch of studies into whether these models are currently or could be abused. They’ve basically gotten a null result. Even as these models were released to the public, they haven’t found to be used in a widespread way for bad. Fake news isn’t worse than it was. And there hasn’t been an uptick in spam comments.
Aaron’s research was in computer vision, like images and videos, so he was much more attuned to “deep fakes”. Fake text has been a problem on the internet since the beginning of the internet. Spam comments, spam emails, what have you. But fake videos of politicians saying inflammatory things – that’s unprecedented. So that was our thought, about what should be guiding policies around this research. The already really bad thing is out there. Deep fakes of audio and video are already here, and the models to generate them are widely available. Given that the abuse of GPT-2 has been minimal or fairly minimal so far, putting this out there seems like the better way to go: in the long term it gives the research community a head start on mitigating some future version or use that’s actually bad. It also lets people see the positive and fun uses of large language models.
Having these models out in the open gives the research community the chance to develop and train against fake text, to detect it, so that they can work on these problems in an open and collaborative way. This is how computer security research works, and it was established years ago. If you find something potentially dangerous – some exploit – you don’t hide it, because someone will find it and abuse it. You tell people quickly and provide every tool so you can address it.
Q: That’s great. And now you find yourself here, at Luminoso. What’s that like as a former academic?
There’s a lot of really cool things that are happening in NLP research in academia. But getting that research to work on real customer data is a whole different problem. You can create, train on, and test with these models in the lab and think they work really well. But when you work outside of the lab, you’re seeing completely different data, and oftentimes what you’ve done won’t work. You also have to consider that you’re tackling business problems and questions you wouldn’t get in an academic setting.
Q: So true. We’re really excited to have you on the team! What’s next for you in your work here?
Computer vision really grew up as a field in the last few years, to the point where you could take things that only used to be in the lab and now put them in the real world. Snapchat face filters are an example of people using really advanced computer vision algorithms, and taking them for granted every day.
I think NLP is undergoing a similar transition now. Techniques developed in academia are starting to work really well for certain kinds of NLP problems, like sentiment analysis and translation. I’m really motivated by finding the answers to “Why do these things that work in the lab break in the real world?” You can’t have things only work in the lab, and never be applicable in real world settings. That’s not fun for me.
Relevant to that, these large language models that I’m interested in … that’s what’s driving NLP research right now. Luminoso uses word vectors and common-sense knowledge in its models, and we’ve figured out how to put this in products. These large language models are the new thing. I’m excited to use these in a real-world setting.
Q: It was great chatting with you today. We’re excited to see what’s next for NLP.
Thanks! It was a pleasure. No one knows what’s next, and that’s the cool thing.