Enter The Matrix (Part 1)

Update: Part 2 of this series now available here!

For a long time now, I’ve been a big fan of The Matrix Trilogy by the Wachowski brothers.  The Matrix especially is a movie that sits near the top of my all-time movie lists.  I figured it was about time I dug into the movies in a bit more detail, so I decided I’d finally bite the bullet and create a Qlik app.  Little did I know what I would be getting myself into.  You can see a sample of the final app below.  I’ll be sharing it in a future post, i promise:

What I expected to be a relatively straightforward project turned out to be a meandering adventure into the internet, Qlik Web Connectors, REST APIs and extensions, plus some interesting residual challenges to boot.

In this post (#1 of several), we’ll talk about how we obtained and extracted the 3 Matrix Trilogy scripts, and then how we lay the groundwork to perform sentiment analysis on every line of dialogue using the Aylien Qlik Web Connector.

Let’s Enter the Qlik Matrix…..

1)  Sourcing and Extracting Movie Scripts

A key driver for this project was to be able to ingest scripts from all 3 movies (The Matrix, Matrix Revolutions and Matrix Reloaded) and analyse them.  For that, and to make my job as easy as possible, I started hunting for scripts for each of the 3 movies in the same format to ensure I only need to process them once.  I found these eventually at matrixfans.net (Matrix, Reloaded and Revolutions)

I manually copied and pasted each script into Text files using UltraEdit (Other text editors are available…), and started to process the files using QlikView.

For this project, I decided to focus on analysing the speech in the movies, not scene changes or set direction.  Maybe one for the future.  I was able to filter out references to these by removing all lines beginning with ‘(‘ or ‘{‘. I split the files into 2 fields – OrigCharacter and Dialogue.  Also, I looked for occurrences of ‘?’ to see who asked the most questions.

Analysing the results of OrigCharacter in QlikView, it was clear that some /n characters were causing an issue randomly, especially in the middle of long strands of dialogue.  So, I did a bit of manual editing of the script files to remove these using OrigCharacter (Usually it was showing as a line of dialogue where a newline /n had caused an issue) as a strong hint of where to look.

After that, I exported the cleaned list of Characters out of QlikView, and spent around an hour manually mapping them to their true character names.  Example – Agent Smith often appears as Smith 1, Smith 2 etc. on the occasions there are multiple Agent Smiths replicating themselves.  But I’m counting that all as one Smith 😊

I then saved and fed back the curated list to give myself a list of master Mapped Characters across the trilogy.  And even at this stage, there’s a possibility to do some rudimentary dialogue analysis based on Character and Movie. The code ended up looking a bit like this:

But we don’t stop there, of course!

2) Script Sentiment Analysis using Aylien Qlik Web Connector

So, one of the most interesting things I wanted to do was provide sentiment analysis on the dialogue.  By sentiment, I mean classifying each dialogue strand (which could be multiple sentences) as Negative, Neutral or Positive, as well as understanding whether the quote was Objective or Subjective.

I remembered from a previous project that Qlik Web Connectors had some sentiment options, but I was keen to use something that was free to test this out, and so that you guys could too if you needed to.

Download Qlik Web Connectors

To download the installation packages for the connectors, go to Qlik Market and log in with your Qlik account.  Download and unzip Qlik Web Connectors.  Then start the executable QlikWebConnectors.exe:

You should then see Qlik Web Connectors running and be able to click the link to get to the web interface:

Finally, you’ll be looking for the Qlik AYLIEN Connector in the BETA tab.

 

Create an Aylien Developer Account

Create an account at Aylien.com to get an App ID and API Key.

There are some important things to note if you want to try the Aylien connector:

  • Aylien free account only allows 1000 API calls per day (includes web interface and Qlik extractions), and I needed to extract 2500 dialogue passages over 3 days (or 1 day and 3 trial accounts – at your own risk!)
  • Aylien API requires the Sentiment Endpoint, I used the Twitter Sentiment Mode. I kinda had to choose that one, maybe a future improvement could also to use the Document sentiment mode and do a comparison of the results for accuracy purposes.
  • Aylien is a BETA Qlik Web Connector and will expire on January 31 2018, so if you want to try this yourself for free, you need to do it by then.

Configure the Aylien Qlik Web Connector

It’s important to test the Aylien web connector using the Web Connector interface.  By doing this, you can then generate the code which you can pop straight into QlikView.

Modifying the Web Connector Script and Adding Encoding

There’s then a bit of tweaking of the code needed to make the above code run in a loop, and effectively process the ~2500 dialogue passages in the Matrix Trilogy.

As you can see, we’ve encased the code in a loop, and used the Row number generated when we were loading in the script files to get the unique line of dialogue.  The dialogue is then encoded using textEncoded variable, and finally, the call is made to the web connector.

Some other key points here:

  • Qlik Web connectors MUST be running for this script to work
  • Note the sentimentText=$(textEncoded) part of the the FROM statement.  This ensures that the correct Dialogue for each row is inserted for analysis
  • I had to run in batches of 1000 on a daily basis to stay under the Aylien usage limits on a free account.  Remember to amend the loop code accordingly, and don’t overwrite your QVDs!
  • If your script fails, re-run it manually through the web interface to make sure you’ve not missed something.
  • Don’t forget to encode your dialogue!!!  See below…

At the end of this, you’ll end up with same incredibly powerful sentiment analytics on every line of dialogue, including:

  • Sentiment Polarity (Positive, Neutral, Negative)
  • Sentiment Confidence (0 to 1, 1 being most confident)
  • Subjectivity (Objective, Subjective)
  • Subjectivity Confidence (0 to 1, 1 being most confident)

Some really interesting potential here to see if Agent Smith really is as evil as he makes himself out, and how confident is he in the many monologues he has throughout the trilogy?

Don’t Forget The Encoding!!

URL Encoding is vital for Aylien in the Edit Script of the app – the Qlik suggested default script on the Help pages is a good start but not enough, you need to add some more character encoding such as ! and ‘ at least.  I have used the following URL encoding Sub:

Sub urlEncode(str)

let str=replace(str, ‘%’, ‘%25’); // should be first
let str=replace(str, ‘#’, ‘%23’);
let str=replace(str, ‘ ‘, ‘%20’);
let str=replace(str, ‘!’, ‘%21’);
let str=replace(str, ‘$’, ‘%24’);
let str=replace(str, ‘&’, ‘%26’);
let str=replace(str, ‘’’, ‘%27’);
let str=replace(str, ‘+’, ‘%2B’);
let str=replace(str, ‘,’, ‘%2C’);
let str=replace(str, ‘.’, ‘%2E’);
let str=replace(str, ‘/’, ‘%2F’);
let str=replace(str, ‘\’, ‘%5C’);
let str=replace(str, ‘:’, ‘%3A’);
let str=replace(str, ‘;’, ‘%3B’);
let str=replace(str, ‘=’, ‘%3D’);
let str=replace(str, ‘?’, ‘%3F’);
let str=replace(str, ‘@’, ‘%40’);
let str=replace(str, ‘[‘, ‘%5B’);
let str=replace(str, ‘]’, ‘%5D’);
let str=replace(str, ‘>’, ‘%3E’);
let str=replace(str, ‘<‘, ‘%3C’);
let str=replace(str, chr(10), ‘%0A’); // Line feed.
let str=replace(str, chr(39), ‘%27’); // 39 Apostrophe

call=str;

End sub

Next Time…

In the next post, we’ll look at how we get Matrix Trilogy movie information using the REST API connector, as well as sourcing and installing the iconic fonts, and picking out some interesting extensions in QlikView useful for this type of analysis.

Until next time (hopefully very soon!), continue to makeitqlik!

Best,

Brian

Add a Comment

Your email address will not be published. Required fields are marked *