Modeling and Mining of Networked Information Spaces

Yet another attempt at live-blogging an academic talk … will try to clean it up later.

Evangelos Milios received a diploma in Electrical Engineering from the National Technical University of Athens, Greece, in 1980 and Master’s and Ph.D. degrees in Electrical Engineering and Computer Science from the Massachusetts Institute of Technology, Cambridge, Massachusetts, from where he graduated in 1986. While at M.I.T., he was a member of Digital Signal Processing group and he worked on acoustic signal interpretation problems at the M.I.T. Lincoln Laboratory. After spending 5 years doing research on shape analysis and sensor-based mobile robotics in the Department of Computer Science, University of Toronto, he joined York University in 1991 as an Associate Professor.

Since July of 1998 he has been with the Faculty of Computer Science, Dalhousie University, Halifax, Nova Scotia, where he is Professor and Killam Chair of Computer Science. He was Director of the Graduate Program (1999-2002). He is a Senior Member of the IEEE. He served as a member of the ACM Dissertation Award committee (1990-1992),a member of the AAAI/SIGART Doctoral Consortium Committee (1997-2001) and he is co-editor in Chief of the journal Computational Intelligence. He has published on the processing, interpretation and use of visual and range signals for landmark-based navigation and map construction in single- and multiagent robotics. His current research activity is centered on modeling and mining of content and link structure of Networked Information Spaces.

Modern document resources are often interlinked, and possibly in more than one way. The citation graph is a typical example, in which papers are linked via references and citations, and also via co-authorship relationships. Several problems have been defined on such organically grown networked information spaces. Modeling such spaces has given rise to various plausible evolution models with interesting theoretical properties, but no tools for verifying whether the models are good approximations of the real graphs. We are developing a rich summary representation of graphs based on degree cores and we demonstrate that many of the models are rather poor approximations of real graphs. In a different project, we are developing probabilistic approaches to document co-clustering, by extending the standard Latent Dirichlet Allocation model to capture, not only correlation between words, but also between topics. Current research is directed towards combining content and link structure in clustering and co-clustering in networked information spaces.

MoMiNIS

Networked Information Spaces
* text information is everywhere, networked (WWW, blog, scientific lit, patents, corporate networks, common law, social networking, e-mail networks)
* volume grows rapidly and networks grow large, are dynamic

Managing text information
* info retrieval is not enough (do more than retrieve info)
* make text collections browsable … created connections based on content, summarize, cluster & classification
* understand network structure, dynamics (terminology, connections)
* Dynamics: evolution of scientific fields, researcher communities … find pockets of unusual activity

1) degree-core representation of large graphs (network)
2) automatic topic extraction from text collections (text)

How to represent large graphs (difficult to visualize, compare; summarize graph structure)

degree core:
* subgraph H of graph G = (V,E)
* remove all nodes, progressively, where degree < k (k-core decomposition) * summarize by creating a tree structure showing decomposition How well do the standard models (Erdos-Renyi, Linear Cord Diagram, Chung and Lu deletion/partial duplication) apply to real-world graphs? * Real-World data: omit "shrubs" of the tree (shallow depth components) for clairty * Real-World data: number of vertices decreases smoothly * Erdos-Renyi: linear structure of tree, vertices decline slowly, no fluctuation in components (after drop) * Linear Cord Diagram: linear structure of tree, vertices decline slowly, no fluctuation in components (after drop) ... nearly identical to E-R * Chung and Lu partial duplication: linear structure of tree, vertices decline regularly, no fluctuation in components (after drop) ... vertex decline more like RW * Email graph (RW) matched by all models (failed to split into multiple components) and size distribution matches (mostly) partial duplication * none of other real-world graphs are matched by any model (complexity of data not explained by random models) Suggestion to grow large graphs by starting with summarization, rather than the other way around This is a potential analytic tool to identify more dynamics of real-world networks that is hidden by assumptions of construction of random models Automatic topic extraction from text collections * model-based overlapping co-clustering * latent direchlet co-clustering * extension of previous work personal digital library * search extractions - keyterm, key sentence / summarization, metadata, taxonomies (hierarchical clustering) * make digital library more useful by using all search tools text representation * vector space model ... typical method * high dimenionality ... feature selection, transformation * linguistic properties of text .. synonymy, polysemy, no other semantic link between words * word order in text ... no sequential link between words Co-Clustering * given multi-dimensional data matrix ... simultaneous clustering along dimensions * two-dimensional matrix .. subset of rows exhibiting similar behavior (across columns) * goal: exploit duality between row, column clustering ... co-clustering and dimension reduction * non-overlapping co-clusters * mixture models and co-clustering (rows,columns .. block) experiment with MovieLens and Reuters text data sets * compare: model-based overlapping clustering ...K-Means algorithm ... information t? co-clustering * some promise that this approach can be useful doing some work on clustering of Wikipedia data not quite "ready for prime time yet" but these ideas from 15 years ago are now approaching being practical

By Kevin Makice