Wednesday, August 14, 2013

Everybody likes to predict, but nobody likes being predictable, nor told what to do

The Netflix algorithm is in the news again.
The Science Behind the Netflix Algorithms That Decide What You’ll Watch Next

Netflix finds rating predictions are no longer as important, trumped by current viewing behaviour, i.e. what you are watching now. However, browsing through the comments, and again, you will see a generally negative reaction. Some people really hate being told what to watch, even if it's just a recommendation. Others say Netflix sucks, because it recommends things they've watched elsewhere. That sounds like a lack of understanding: if you don't tell Netflix you've watched something already, then how could it know?

As "big data" gets more media attention, it is reaching a wider audience who don't yet understand how algorithms work, but only know there are algorithms everywhere in their life, and it's scary to them. The lack of understanding seems to create fear and resentment.

LinkedIn and Facebook's recommendation systems for helping people find colleagues or friends they may know are generally well received, yet these film recommendation systems aren't. The difference between them might underline the success criteria of rolling out such recommendation systems.

Tuesday, August 13, 2013

Machine Learning in Movie Script Analysis Rouses Angry Reactions

An application of Machine Learning is covered in the news lately: movie script analysis.
Solving Equation of a Hit Film Script, With Data

They "compare the story structure and genre of a draft script with those of released movies, looking for clues to box-office success". However, the comments reveal that the general population (at least of the commenters) dislikes the concept for fear of anti-creativity.

Comments like these sum up the overall sentiment:
"Using old data to presage a current idea is both terrible and foolish. It is to writing what Denny's is to fine dining - mediocrity run wild."   
"Data crunchers will take the art out of everything. Paint-by-numbers."  

Ouch.
You be the judge whether this is a good application or not.

I tend to bias towards answers like this from the comments (sadly this was only 1 of 2 positive comments at the time of my reading; the other one was from the CEO of the script analysis business):
"I'm sure people have all sots of assumptions about what audiences like already. This data could be a tool to look deeper into these assumptions. Film makers have always wondered about consumer taste. It is a business. When commerce and art mix, there are inevitable compromises. This tool helps people see possible preferences based on past behavior. Information should never frighten us. It is how this information is applied that most deserves our attention." 

I think it also never helps the image of such machine learning practitioners when the journalist tries to paint him with an antagonist brush, such as "chain-smoking" and "taking a chug of Diet Dr Pepper followed by a gulp of Diet Coke and a drag on a Camel". Reminded me somewhat of another writer's writing style when covering analytics.

Monday, August 12, 2013

Our labels: data scientist vs statisticians (or OR)

A perennial discussion of identities in the world of analytics is making the rounds on the blogs of statisticians. Or wait a second, what should we call them?
Data scientist is just a sexed up word for statistician

Data Scientists, Statisticians, Applied Mathematicians, Operational Researchers...jus to name a few, are the labels one might apply to themselves in the field of analytics. How shall we label ourselves? I can't agree more with Nate Silver,
"Just do good work and call yourself whatever you want."

Value chain trumps good design - ColaLife

Babies in Africa suffer and die from diarrhoea, but it's easily treatable with medicines that costs pennies. The problem is getting the medicine into the mothers hands - a supply chain problem in a rural and sparsely populated area.

Here comes ColaLife: Turning profits into healthy babies.

Inventing medicine packaging to fit into coca cola bottle gaps is ingenious, but understanding the value chain, so that all hands that touch the supply chain of the medicine has an incentive to ensure its stock and flow, is even more important.

If there is only one message to take away, I would choose:
"What's in it for me?" 
Always ask this to make sure there is a hard incentive for all players to participate. Free give-aways are often not valued, resulting in poorly managed resources and relatively low success rate. Ample training and advertising for awareness and effective usage is also key for product / technology adoption.

Saturday, August 3, 2013

The Slightly Rosier Side of Gambling Analytics

Having posted about the ugly side of analytics - casino loyalty programmes, the Guardian's DataBlog caught my eye with their article on a rosier side of gambling analytics, where UK technology firm uses machine learning to combat gambling addiction.

Of course, a business is still a business. It needs to be profitable, so there are reasons more than just "let's be good". I list out below my take on the reasons for "them" the gamblers clients, and the reasons for "us" the casinos. Note, I simply assumed the machine learning study is sponsored by the casinos.

Just for "them":

Casinos too have a corporate social responsibility (CSP). Helping pathological gamblers, or identifying them before they become one is a nice thing to do.

For "them" and for "us":

More for everyone! They get to play more, and we get to profit more. The more people play a bit for longer is better than playing a lot for a short amount of time due to self exclusion lists. (I'm not sure which is the better evil of the two though...)
That's the business case. It's not all soft and cuddly like the CSP. Well, ok, business cases almost never are.
"If you can help that player have long term sustainable activity, then over the long term that customer will be of more value to you than if they make a short term loss, decide they are out of control and withdraw completely"

Just for "us":

Minimising gambling problems helps keep the country's regulators off the companies' backs, so they don't have to relocate when the country's regulations tighten. Relocation = cost. A lot of it.
Plus,
"And there's also brand reputation for the operator. No company wants to be named in a case study of extreme gambling addiction, to be named in relation to a problem gambler losing their house"


A side note: This reaffirmed why I don't gamble...it's a lose-win situation.

"A lot of casino games operate around a return-to-player rate (RTP) whereby if the customer pays, say £100, the game would be set up to pay back an average of £90. Different games will have different RTPs, and there are a few schools of thought on whether certain rates have different impacts on somebody's likelihood of becoming addicted.Some believe that if you lose really quickly, you'll be out of funds very quickly and will leave, and that a higher RTP will keep people on site, but others disagree"

I highly recommend reading the full article on the DataBlog.


Thursday, August 1, 2013

The Ugly Side of Analytics - Casino Customer Loyalty

While listening to This American Life's episode "Blackjack", its Act 2 had me in the car saying, "oh no, they did not!"  The "they" is the Caesars Entertainment Corporations (the casino), and yes, they have a customer loyalty programme that they use to "attract more customers", and claim it's no different than other such programmes in industries like supermarkets, hotels, airlines or dry cleaners.

Well...there is a wee bit of difference.

No one is addicted to dry cleaning.

I am saddened that analytics is used to help the casino loyalty programme and hurt the pathological gamblers. The show indicates that the programme identifies "high value customers" using loyalty cards, tracking all spend and results, and then offer them the "right" rewards to keep them coming back. Most addicted gamblers are "high value customers". The bigger the looser, the more the reward. Rewards include drinks and meals, hotel suites, trips to casinos (if you don't live there), to gifts like handbags and diamonds.

Analytics and Operational Research is supposed to be the Science of Better.

I'd like to call on all professionals in the analytics field to reflect on the moral goodness, or lack of, in your work.

There is still hope though. If casinos can use analytics to identify problem gamblers, then others can too. Given pathological gambling is a mental health issue, is it time for NGOs or governments to catch up with technology and get their hands on those loyalty card data?

Monday, July 29, 2013

Learn R with Coursera for Data Analysis

Heads up: the Computing for Data Analysis course is running in September 2013.

It will teach you the R language for data analysis. The course is described as:
This course is about learning the fundamental computing skills necessary for effective data analysis. You will learn to program in R and to use R for reading data, writing functions, making informative graphs, and applying modern statistical methods. 
In this course you will learn how to program in R and how to use R for effective data analysis. You will learn how to install and configure software necessary for a statistical programming environment, discuss generic programming language concepts as they are implemented in a high-level statistical language. The course covers practical issues in statistical computing which includes programming in R, reading data into R, creating informative data graphics, accessing R packages, creating R packages with documentation, writing R functions, debugging, and organizing and commenting R code. Topics in statistical data analysis and optimization will provide working examples.


Related article:
Coursera and the Analytics Talent Gap
Starting up in Operational Research: What Programming Languages Should I Learn?

Sunday, July 28, 2013

Even Google can't get their numbers straight

Google has so many various entities and products, either grown within the organisation or externally acquired. It appears that even Google, the leader in Data Science and Analytics, cannot get all the numbers straight across their products: Google Analytics vs. Blogger.

Is this blog really that popular? Really?

While I was checking this blog's traffic numbers on Blogger's built-in "Stats" function, I was really surprised that the blog seems to be really popular, even though I have not been good (sorry!) at writing much for some time. As an ex-SEO'er, I had an inkling that something is not right. Up comes Google Analytics.

Blogger Stats numbers are 4.5 times bigger than Google Analytics'.

After checking my Google Analytics (GA) numbers. I was really surprised to see that the Blogger Pageview numbers were 4.5 times bigger than the GA numbers. That is a staggering difference!

After some research on the web, I concluded that:
  1. GA is much closer to the truth (but not quite completely true, see 3 below).
  2. Blogger stats include all kinds of bots traffic, so it's heavily inflated (GA tries to filter most out).
  3. GA cannot count any traffic if the user has disabled Javascript. Some folks suggest it undercounts traffic by 50%, but there is no hard evidence to back it up, so take it with a grain of salt.
  4. Blogger seems set on reporting only Pageviews, not any other useful metrics, such as Visits or Unique Visitors. Not sure why.
  5. This blog has probably been targeted by a spam bot. Upon closer look, one of the bots probably comes from a particular Dutch ISP.

Share best practice and be consistent.

I would have expected Google, the leader in Data Science and Analytics, to share best practice amongst its entities and products, such as reporting on key metrics (not just Pageviews).

I would also have expected Google to be able to have a consistent set of numbers amongst its entities and products. Doesn't appear so neither.


The majority of a Business Intelligence (BI) analyst's job is spent verifying and reconciling numbers amongst various reports, more often than not. Major BI tech giants sell BI applications that often allude to reducing such activities and increasing business confidence in the numbers in their data warehouse. However, it is still a major challenge to most companies, as evidenced here. Without a good and reliable data source, the validity of any following analysis is heavily undermined.

Let's try to stay consistent.
That goes for the metric choice, and the numbers.


FYI: if you want to find out if and who is attacking your site with spam bots, read this helpful post.

Saturday, April 6, 2013

7.2% raise for 1,000 best paid Ontario public sector employees




The top 1,000 employees with the highest package (salary + taxable benefits) in the Ontario Public Sector Salary Disclosure, the so-called “Sunshine List”, saw an average increase of almost $25,000 in 2012 compared to the previous year, an increase of 7.2%, much higher than the bottom half of the 80,000-strong list which saw an increase of only 2.2%.

Is this cause for alarm? Highly paid CEO's are fully in the public spotlight, and the many many school principals have their pay closely monitored, but what about the highly paid individuals near, but not at the top? The data shows that for them, 2012 was a good year.

Every year since 1996, the Ontario Ministry of Finance has released a list of all public sector employees who earned more than $100,000 in the previous year.

Oversight

We can all see that “Sunshine List” champion Thomas Mitchell, President & CEO of Ontario Power Generation took a pay cut this year, but with approaching 100,000 names on the list, more sophisticated, data-drive oversight is possible.

Government-friendly observes point out that the average salary on the list has decreased, just like last year, but that is a red herring. Anyone can add over 9,000 people earning just over $100k to a list with an average salary of $129k and bring down the average. As the list continues to grow from the bottom, we can expect the average salary to decline, without this being any indicator of public fiscal discipline.

Opposition partisans will lament the increasing growth of the list, 9,000 more this year and 7,500 the year before. This is again misleading. The pyramid shape of any organisation tells us that there are more people as you move down the salary brackets. With a perfectly reasonable average salary growth at just over 2.5%, 9,600 employees graduated to the “Sunshine List” this year after having earned around $98k last year. Probably more than 9,600 employees, currently earning around $98k will be new additions to the list next year, and more the year after. Inflation and economic growth will ensure that the list grows, and the pyramid shape will ensure that it grows faster.

Top 1,000

So who are these lucky 1,000 who on average made 7.2% more in 2012?

This year the top 1000 best packages on the list included:
  • 583 individuals working in hospitals
    • 176 Pathologists
    • 50 Chief Executive Officers
    • 66 Vice-Presidents (Senior, Executive, etc.)
    • 79 Psychiatrists
  • 86 employees in electricity
    • 56 Vice-Presidents (Senior, Executive, etc.)
  • 144 working at Universities
    • 100 Professors
Big raises

Of the 1,000, 737 can be matched exactly by name and organisation type to last year. 92 of those fortunate souls saw an increase of over 25%! At the top of the pack was Mohamed Abelaziz Elbestawi, Vice-President Research/Professor at McMaster University who was reported as paid salary $266k in 2011 and $506k in 2012! Trung Kien Mai, a Pathologist at The Ottawa Hospital saw his paid salary move from $306k in 2011 to $515k in 2012!

Of those 92 with big raises:
  • 83 work in hospitals
    • 50 are Pathologists
More questions

At this point, this analysis raises more questions than it answers, but that is to be expected from an analysis of this salary disclosure data. The Public Salary Disclosure Act can help us find questions, not answers. What we do know is that:
  • Salaries near the top grew substantially
  • Those salaries grew much more, even on a % basis than those at the bottom
  • Growth was higher than expected given slow economic growth
  • Some individuals can be shown to have experienced extraordinary raises
  • Pathologists do well, and 2012 was a particularly good year for some

Source: http://www.fin.gov.on.ca/en/publications/salarydisclosure/pssd/

Timberland customer care & operations - I approve!

Buying a brand is buying quality - that's especially true for outdoor equipment.

With this belief, I purchased a pair of Timberland hiking boots that said "Waterproof" on a piece of official-looking metal attached to them. I then ended up with wet feet during an 8-day trek in Patagonia where it often rains - that sucked.

With my toes literally swimming in water within the boots, after a soppy wet day of a 19km hike, I was not a happy camper. However, my perception of Timberland took an 180 degree turn for the better.

Having bought the boots in southern Chile in a Bata store, having used them extensively and been disappointed and upset by them, I ran into a Timberland brand store 2,500km away from where I bought them, still in Chile. I went and complained about my disappointment in these supposedly "waterproof" boots, and I was offered the chance to exchange them for a brand new pair that is indeed waterproof, paying only the small price difference between the two pairs.

This is operationally remarkable:

Different stores (Bata vs Timberland)
I bought them in Bata, which is a popular international brand that happens to carry the Timberland boots. However, I was able to exchange them in a Timberland own brand store. Given the receipts I got from the Timberland store says "Bata" on it, I suspect the two are operated by the same company. However, as a western audience, can you imagine buying something in Gap and then returning in Banana Republic (same mother company)?

Different cities and provinces
I don't know how it's like in the US, but in Canada, returns and exchanges wouldn't be possible cross provincial borders. Yet, in this case, it was not a problem.


After the 14-day exchange period without the paper receipt
It was at least 3 weeks after the original purchase date, while the receipt stated a 14-day exchange period. I also didn't keep the paper receipt (trying to be light while travelling), but I had a photo of it on my phone. This I was able to email to them to enable the processing. Again, can you imagine this to happen in a western country? 


"Waterproof"  "Gore-Tex"
Finally, for everyone's learning, apparently, if it only says "waterproof", it's not waterproof. Only if it says "Gore-Tex", then it's actually waterprof.


I went into the Timberland store only to vent my frustration. I was positively flabbergasted when they offered to exchange for a new pair. Not only is the customer care commendable, but operationally that this could happen is something I would never have expected. They basically went against all the rules I know that would make this infeasible in western countries. Yet, the teens that worked at the Timberland store were willing enough to find ways to help me, a foreigner with broken Spanish, so I would have this outstanding experience and be happy with the decently expensive pair of hiking boots. How they keep the books straight on this transaction is beyond me, 'cause surely they are running Bata and Timberland as two separate business entities. 

The result: Timberland now has a new loyal customer. This is an outstanding example of great customer care made possible by some well-integrated and smooth operations.