search
Biztech2

HOME  / NEWS / BUSINESS INTELLIGENCE / SGI ANALYSES FULL TEXT CONTENT OF ENGLISH EDITION OF WIKIPEDIA

SGI Analyses Full Text Content Of English Edition Of Wikipedia

by Biztech2.com Staff 21st June, 2012 in Business Intelligence

   

SGI, the trusted leader in technical computing has partnered with Kalev H. Leetaru of the University of Illinois to create the first-ever historical mapping and exploration of the full text contents of the English-language edition of Wikipedia, in time and space. The results include visualisations of modern history captured in under a day utilising in-memory data-mining techniques.  Loading the entire English language edition of Wikipedia into SGI UV 2000, Leetaru was able to show how Wikipedia’s view of the world unfolded over the past two centuries. Location, year and the positive or negative sentiment have been tied to those references.  
While several previous projects have mapped Wikipedia entries with manually assigned location metadata by an editor, these previous attempts only accounted for a tiny fraction of Wikipedia’s location information.  This project unlocked the contents of the articles themselves, identifying every location and date in all four million pages and the connections among them to create a massive network.

“This analysis allows the world to take a step back from the individual articles and text to gain a forest view of the tremendous knowledge captured in Wikipedia, not just a page by page tree view. We can watch how one of the largest collections of human knowledge has evolved and see what we could never see before, such as global sentiment at a certain time and place, or where there might be blind spots in the knowledge coverage,” said Franz Aman, chief marketing officer and head of strategy, SGI. “We love to use Google Earth because we can zoom out and get the big picture view.  With SGI UV 2, we can apply the same concept to Big Data to get the big picture on our Big Data.”

From this analysis, Wikipedia is seen to have four periods of growth in its historical coverage: 1001-1500 (Middle Ages), 1501-1729 (Early Modern Period), 1730-2003 (Age of Enlightenment), 2004-2011 (Wikipedia Era) and its continued growth appears to be focused on enhancing its coverage of historical events, rather than increased documenting of the present.  The average tone of Wikipedia’s coverage of each year closely matches major global events, with the most negative period in the last 1,000 years being the American Civil War, followed by World War II. The analysis also shows that the “copyright gap” that blanks out most of the twentieth century in digitised print collections is not a problem with Wikipedia where there is steady exponential growth in its coverage from 1924 to today.

“The one-way nature of connections in Wikipedia, the lack of links, and the uneven distribution of Infoboxes, all point to the limitations of metadata-based data mining of collections like Wikipedia,” said Leetaru.  “With SGI UV 2, the large shared memory available allowed me to ask questions of the entire dataset in near-real time. With a huge amount of cache-coherent shared memory at my fingertips, I could simply write a few lines of code and run it across the entire dataset, asking whatever questions came to mind.  This isn’t possible with a scale-out computing approach.  It’s very similar to using a word processor instead of using a typewriter – I can conduct my research in a completely different way, focusing on the outcomes, not the algorithms.”

Loaded into SGI UV 2000, the Big Brain computer, this massive dataset underwent full text geo-coding and complete date-coding, using algorithms that identified every mention of every location and every date across the text of every entry on Wikipedia.  More than 80 million locations and 42 million dates between 1000 AD and 2012 were extracted, averaging 19 locations and 11 dates per article (every 44 words and every 75 words, respectively).  The connections between every date and every location were captured into a massive network representing Wikipedia’s view of history.  With this instrumentation, Leetaru was able to perform near-real time analysis over the entire dataset on the SGI UV 2 to create visual maps throughout space and time to see not only how history unfolded but also the overall tone of the world throughout the last thousand years, and interactively testing a wide array of theories and research questions, all in less than a day’s work.

SGI UV 2 product family enables users to find answers to the world’s most difficult problems on a system as easy to administer as a workstation.  Built with Intel Xeon processor E5 family, running standard Linux, and supporting a wide range of storage options, SGI UV 2 offers a complete, industry-standard solution for no-limit computing.

With as little as 16 cores and 32 gigabytes of memory, SGI UV 2 can start small and seamlessly expand. This next generation platform doubles the number of cores (up to 4096 cores) and quadruples the amount of coherent main memory (up to 64 terabytes) from the previous generation, available for in-memory computing in a single-image system.  SGI UV 2 can scale to eight petabytes of shared memory and at a peak I/O rate of four terabytes per second (14 PB/hour) it could ingest the entire contents of the U.S. Library of Congress print collection in less than three seconds.
 

Tags: SGI, Wikipedia, Data Mining, Big Data, Analysis

   

« Previous Story

AMD Announces Cloudera-Certified...

« Next Story

IBM Unveils Analytical Decision...

POST YOUR COMMENTS

COMMENTS

Cool!

  posted by GLaDOS

22nd June, 2012 6:50am

CFO's Favourite Tech To Spend On: BI

20th May, 2013 by Biztech2.com Staff

CFO's Favourite Tech To Spend On: BI

20th May, 2013 by Biztech2.com Staff

SAP Enhances Its BI Portfolio

20th May, 2013 by Biztech2.com Staff

HP And SAP Advance SAP HANA Through Joint Innovation

More Related News

India, Switzerland pact soon to boost healthcare ties

#

India and Switzerland will soon sign a pact to give further impetus to cooperation in the field of healthcare.

IPL spot-fixing: Is Sreesanth’s celebrity status weakening his case?

#

Sreesanth's lawyers say that this very adoration and fame is what worked to his disadvantage today.

IPL Playoff Live: Smith on fire as Mumbai chase 193

#

On paper, there is very little to choose between the two teams, who have both won 11 out of their 16 matches.

MORE NEWS

MORE IN BUSINESS INTELLIGENCE

Understand The Business User Before Strategising BI Plans

21st May, 2013 by Robin Chatterjee

Amidst this complex environment of economic slowdown and enormous data-growth,...

Read more

CFO's Favourite Tech To Spend On: BI

20th May, 2013 by Biztech2.com Staff

Gartner survey finds that BI is becoming less of a CIO responsibility and more...

Read more

SAP Enhances Its BI Portfolio

20th May, 2013 by Biztech2.com Staff

BI innovations from SAP to unlock the power of data for real-time insights and...

Read more

HP And SAP Advance SAP HANA Through Joint Innovation

20th May, 2013 by Biztech2.com Staff

Project Kraken enables customers to reduce the time required to rapidly...

Read more

Fractal Analytics Named 'Cool Vendor In Analytics' By Gartner

16th May, 2013 by Biztech2.com Staff

'Cool Vendor' is an annual report that identifies the new cool...

Read more