The semantic web / NLProc world is abuzz today with the news of Google’s Knowledge Graph.
I’m thrilled and fascinated by Google’s work in this arena. They’re taking a true “web scale” approach towards knowledge extraction. My company (AlchemyAPI) has been working in this area intensely over the past year, examining large swaths of the web (billions of pages), performing language / structural analysis, and extracting reams of factual and ontological data. We’re using gathered data for different purposes than Google (we’re enhancing our semantic analysis service, AlchemyAPI — whereas Google is improving the search experience for their customers), but we are both using some analogous approaches to find and extract this sort of information.
What’s interesting to me, however, is how this is really a sort of tipping point for Google. We’re witnessing their evolution from “search engine” to “knowledge engine”, something many have expected for years — but which carries a number of consequences (intended and unintended).
Google has always maintained a careful balance of risk/reward with content owners/creators. They provide websites with referral traffic (web page hits), while performing what some may argue is wholesale copyright infringement (copying entire web pages, images, even screenshots of web pages).
This has historically worked out quite well for Google. Website owners get referral traffic — thus can show ads, sell subscriptions, and get paid. Google copies their content (showing snippits/images/etc on Google.com properties) to make this virtuous cycle happen.
Stuff like the “Knowledge Graph” potentially torpedoes this equation. Instead of pointing users to the web page that contains the answer to their search, Google’s semantic algorithms can directly display an answer, without the user ever leaving Google.com.
Say you’re a writer for About.com — spending your time gathering factual information on your topic of choice (aka, “Greek Philosophers”). You carefully curate your About.com page, and make money on ads shown to users who read your content (many of whom are referred from Google.com).
If Google can directly extract the “essence” of these pages (the actual entities and facts contained within), and show this information to users — what incentive do these same individuals have to visit your About.com page? And where does this leave content creators?
The risk here isn’t necessarily a legal one — there’s quite a bit of established precedent which states that “facts” cannot be easily owned or copyrighted. But sites could start blocking Google’s crawlers. Noone is likely to do this anytime soon as Google’s semantic features are only just getting started and “referral traffic” is still the biggest game in town. But what does the future hold?
I’m guessing Google will work out these sort of bumps in the road on their path towards becoming a true Knowledge Engine. But it’s an interesting point to think about.
PS: Google Squared could be argued as an earlier “tipping point”, but was largely more of an experiment. The Google Knowledge Graph represents a true, web-scale commercial effort in this arena. A real tipping point.
Here’s a fun little demo app myself and a co-worker built:
Thisapplication leverages the MS Kinect to manipulate 3d visualizations of social media data. The application tracks 3d motion of a person’s hand, using it as a virtual mouse cursor.
Social media data mined from tens of millions of news articles and blog posts over a period of 1+ month, using natural language processing algorithms to analyze article/blog contents, identify named entities and trends, and track momentum over time.
Info on this app:
real-time 3d visualization of social media data, represented as a force-directed-graph.
social media data was mined from tens of millions of news articles and blog posts over a 1+ month period.
news / blog data analyzed using natural language processing (NLP) algorithms including: named entity extraction, keyword extraction, concept tagging, sentiment extraction.
high-performance temporal data-store enables visualization of connections between named entities (eg, “Nicolas Sarkozy -> Francois Hollande”)
system tracks billions of data-points (persons, companies, organizations, …) for tens of millions of pieces of content.
This is an example “20% time” employee project at my company, AlchemyAPI. We do fun projects like this to spur the imagination and as a creative diversion. Other projects (which I’ll get around to posting at some point) involve speech recognition, robots, and other geektacular stuff.
I’ve always been fascinated by data — both of the companies I’ve founded have addressed aspects of the “data overload” problem. The first, MimeStar, developed NIDS (Network Intrusion Detection System) technology that analyzed gigabits of network traffic every second, reconstructing every IP frame, TCP session, and application-layer protocol stream — looking for computer intrusions and other inappropriate activity. MimeStar was acquired in early 2000 and our products are still protecting government and corporate networks 10 years later. NIDS is fascinating technology, reducing massive packet flows down to intelligible event/activity streams & security alerts.
My present company builds natural language processing (computational linguistics) technology to make sense of the huge quantities of unstructured text residing across the web and within company data warehouses. We’re helping build the semantic web, by “bootstrapping” unstructured content into a form that is understandable by machines. NLP is an exciting space, with real disruption potential. It’s becoming a critical technology for Semantic & Web 3.0 applications/services.
What’s that? You haven’t heard of the Semantic Web? Check out this fantastic video, created by Kate Ray of NYU. Her short documentary does a great job of summing up many of the drivers behind the Semantic Web (such as data overload), and touches upon many of the future applications of this technology.
If disruptive innovation, artificial intelligence, and Web 3.0 are your bread-and-butter, AlchemyAPI is currently hiring. We’re based in Denver, CO and are growing rapidly. Join our team and help build the next generation of semantic technology!
APML, a new standardized format for expressing attention preferences, has been receiving a lot of buzz in recent weeks. Mashable covers the topic here, Brad Feld here, Jeff Nolan here, and Read/WriteWeb here.
It’s great to see an increasing number of folks getting behind the concept of ‘standardized structured attention’ and embracing this emerging standard.
Attention has always been a topic of interest to me, something I’ve blogged about in the past, on a number of occasions. At my company Orchestr8, we’ve been working on solutions that can automatically capture the ‘context’ of a user’s attention and leverage this data in various ways. We’re currently implementing APML support into the next version of our software, which should provide for some really interesting capabilities.
The thing that excites me about APML is that it’s a relatively straight-forward standard (far, far simpler than the many RSS/ATOM variants). This will ease adoption and simplify portability of attention preference data across many products / services. Since APML expresses attention in a relatively abstract way, multiple products (even product domains, for instance Web versus Email) can leverage the same attention data.
Additional tech. note: Thank you, APML authors, for strictly standardizing the date format in the APML spec (ISO8601). If only we could have been so lucky with RSS/ATOM. Now lets hope people actually stick to the date formats!
“While the original vision of the semantic web is grandiose and inspiring in practice it has been difficult to achieve because of the engineering, scientific and business challenges. The lack of specific and simple consumer focus makes it mostly an academic exercise.”
The post touches upon some of the practical issues keeping semantic technology out of the hands of end-users, and potential ways around these roadblocks. Summaries are given for three top-down “mechanisms” that may provide workarounds to some issues:
Leveraging Existing Information
Using Simple / Vertical Semantics
Creating Pragmatic / Consumer-Centric Apps.
I can’t agree more with the underlying principle of this post: top-down approaches are necessary in order to expose end-users to semantic search & discovery (at least in the near-term).
However, this isn’t to say that there isn’t value in bottom-up semantic web technologies like RDF, OWL, etc. On the contrary, these technologies can provide extremely high quality data, such as categorization information. In the past year, there’s been significant growth in the amount of bottom-up data that’s available. This includes things like the RDF conversion of Wikipedia structured data (DBpedia), the US Census, and other sources. Indeed, the “W3C Linking Open Data” project is working on interlinking these various bottom-up sources, further increasing their value for semantic web applications. What’s the point of all this data collection/linking? “It’s all about enabling connections and network effects.”
My personal feeling is that neither a bottom-up or top-down approach will attain complete “success” in facilitating the semantic web. Top-down approaches are good enough for some applications, but sometimes generate dirty results (incorrect categorizations, etc.) Bottom-up approaches can generate some incredible results when operating within a limited domain, but can’t deal with messy data. What’s needed is a “bridging the gap” between the two modes: leveraging top-down approaches for initial dirty classification, and incorporating cleaner bottom-up sources when they’re available.
So how do we bridge the gap? Here’s what I’m betting on: Process-oriented, or agent-based mashups. These sit between the top-down and bottom-up stacks, filtering/merging/sorting/classifying information. More on this soon.