In one of my daily Googles for various bits and bytes available in the world, I came across the useful research project of Steven Skiena of Stony Brook. Skiena, a geneticist and mathematician, is the creator of a suite of algorithmic tools: TextMap, TextMed, TextBlg and TextBiz.
TextMap is sort of a dynamic concept map generated from recent news stories. Through daily spidering of the Web, it identifies certain kinds of entities—including people, cities, countries, companies, universities, drugs, websites and titles—and represents both the spatial and the temporal distance between them. The site analyzes over 1000 domestic and international news sources each day using natural language processing and advanced statistics to generate the content. The back-end magic can be licensed by contacting the prof.
TextMap analyzes recent press to show how rivals Google (yellow) and Microsoft (green) divide the U.S.
This project is appealing. Some user interface shortcomings aside, TextMap has potential to be an effective research tool for discovering how people and ideas fit together. For example, our own Chris Soghoian—who recently got a little press after exposing a privacy concern on Facebook—is associated through the news with Christopher White (TSA) and Matthew Blaze (cryptology prof at Penn). A dual-color bar graph shows their relevance in the news and to each other. You can click on these bars to pull up a list of recent articles in which both entities appear. As a first step of discovery, it might be more useful to explore this network view of the entities rather than the content view Google and other search engines provide.
TextMap offers other analysis, too. The site shows news storms, interaction graphs between two entities, and heat maps showing the frequency of reference in the news across the U.S. The inclusion of spatial relationships tying news to location, coupled with the limited range of time, helps the context of news stories bubble up in a way it cannot when all results are listed in some cumulative ranking. TextMap also reports the media sentiment, a measure of how well-regarded something is in comparison to other entities. This is understood in terms of polarity, a value either above or below a neutral position, and subjectivity, where high scores reflect great passion and low scores indicates apathy.
The other three Skiena projects are interesting as well. TextMed does for medical abstracts what TextMap does for current events, discovering the relationships between biological entities. Similarly, TextBlg shows entity relationships from the blogosphere. TextBiz uses a random-walk to predict future prices for NASDAQ, NYSE, and AMEX stocks. Some of the reports are not generated frequently, but the ideas behind these sites are well worth following as the tools evolve.