The semantic web / NLProc world is abuzz today with the news of Google’s Knowledge Graph.
I’m thrilled and fascinated by Google’s work in this arena. They’re taking a true “web scale” approach towards knowledge extraction. My company (AlchemyAPI) has been working in this area intensely over the past year, examining large swaths of the web (billions of pages), performing language / structural analysis, and extracting reams of factual and ontological data. We’re using gathered data for different purposes than Google (we’re enhancing our semantic analysis service, AlchemyAPI — whereas Google is improving the search experience for their customers), but we are both using some analogous approaches to find and extract this sort of information.
What’s interesting to me, however, is how this is really a sort of tipping point for Google. We’re witnessing their evolution from “search engine” to “knowledge engine”, something many have expected for years — but which carries a number of consequences (intended and unintended).
Google has always maintained a careful balance of risk/reward with content owners/creators. They provide websites with referral traffic (web page hits), while performing what some may argue is wholesale copyright infringement (copying entire web pages, images, even screenshots of web pages).
This has historically worked out quite well for Google. Website owners get referral traffic — thus can show ads, sell subscriptions, and get paid. Google copies their content (showing snippits/images/etc on Google.com properties) to make this virtuous cycle happen.
Stuff like the “Knowledge Graph” potentially torpedoes this equation. Instead of pointing users to the web page that contains the answer to their search, Google’s semantic algorithms can directly display an answer, without the user ever leaving Google.com.
Say you’re a writer for About.com — spending your time gathering factual information on your topic of choice (aka, “Greek Philosophers”). You carefully curate your About.com page, and make money on ads shown to users who read your content (many of whom are referred from Google.com).
If Google can directly extract the “essence” of these pages (the actual entities and facts contained within), and show this information to users — what incentive do these same individuals have to visit your About.com page? And where does this leave content creators?
The risk here isn’t necessarily a legal one — there’s quite a bit of established precedent which states that “facts” cannot be easily owned or copyrighted. But sites could start blocking Google’s crawlers. Noone is likely to do this anytime soon as Google’s semantic features are only just getting started and “referral traffic” is still the biggest game in town. But what does the future hold?
I’m guessing Google will work out these sort of bumps in the road on their path towards becoming a true Knowledge Engine. But it’s an interesting point to think about.
PS: Google Squared could be argued as an earlier “tipping point”, but was largely more of an experiment. The Google Knowledge Graph represents a true, web-scale commercial effort in this arena. A real tipping point.
Here’s a fun little demo app myself and a co-worker built:
Thisapplication leverages the MS Kinect to manipulate 3d visualizations of social media data. The application tracks 3d motion of a person’s hand, using it as a virtual mouse cursor.
Social media data mined from tens of millions of news articles and blog posts over a period of 1+ month, using natural language processing algorithms to analyze article/blog contents, identify named entities and trends, and track momentum over time.
Info on this app:
real-time 3d visualization of social media data, represented as a force-directed-graph.
social media data was mined from tens of millions of news articles and blog posts over a 1+ month period.
news / blog data analyzed using natural language processing (NLP) algorithms including: named entity extraction, keyword extraction, concept tagging, sentiment extraction.
high-performance temporal data-store enables visualization of connections between named entities (eg, “Nicolas Sarkozy -> Francois Hollande”)
system tracks billions of data-points (persons, companies, organizations, …) for tens of millions of pieces of content.
This is an example “20% time” employee project at my company, AlchemyAPI. We do fun projects like this to spur the imagination and as a creative diversion. Other projects (which I’ll get around to posting at some point) involve speech recognition, robots, and other geektacular stuff.