HOME / NEWS / BUSINESS INTELLIGENCE / SGI ANALYSES FULL TEXT CONTENT OF ENGLISH EDITION OF WIKIPEDIA
SGI Analyses Full Text Content Of English Edition Of Wikipedia
While several previous projects have mapped Wikipedia entries with manually assigned location metadata by an editor, these previous attempts only accounted for a tiny fraction of Wikipedia’s location information. This project unlocked the contents of the articles themselves, identifying every location and date in all four million pages and the connections among them to create a massive network.
“This analysis allows the world to take a step back from the individual articles and text to gain a forest view of the tremendous knowledge captured in Wikipedia, not just a page by page tree view. We can watch how one of the largest collections of human knowledge has evolved and see what we could never see before, such as global sentiment at a certain time and place, or where there might be blind spots in the knowledge coverage,” said Franz Aman, chief marketing officer and head of strategy, SGI. “We love to use Google Earth because we can zoom out and get the big picture view. With SGI UV 2, we can apply the same concept to Big Data to get the big picture on our Big Data.”
From this analysis, Wikipedia is seen to have four periods of growth in its historical coverage: 1001-1500 (Middle Ages), 1501-1729 (Early Modern Period), 1730-2003 (Age of Enlightenment), 2004-2011 (Wikipedia Era) and its continued growth appears to be focused on enhancing its coverage of historical events, rather than increased documenting of the present. The average tone of Wikipedia’s coverage of each year closely matches major global events, with the most negative period in the last 1,000 years being the American Civil War, followed by World War II. The analysis also shows that the “copyright gap” that blanks out most of the twentieth century in digitised print collections is not a problem with Wikipedia where there is steady exponential growth in its coverage from 1924 to today.
“The one-way nature of connections in Wikipedia, the lack of links, and the uneven distribution of Infoboxes, all point to the limitations of metadata-based data mining of collections like Wikipedia,” said Leetaru. “With SGI UV 2, the large shared memory available allowed me to ask questions of the entire dataset in near-real time. With a huge amount of cache-coherent shared memory at my fingertips, I could simply write a few lines of code and run it across the entire dataset, asking whatever questions came to mind. This isn’t possible with a scale-out computing approach. It’s very similar to using a word processor instead of using a typewriter – I can conduct my research in a completely different way, focusing on the outcomes, not the algorithms.”
Loaded into SGI UV 2000, the Big Brain computer, this massive dataset underwent full text geo-coding and complete date-coding, using algorithms that identified every mention of every location and every date across the text of every entry on Wikipedia. More than 80 million locations and 42 million dates between 1000 AD and 2012 were extracted, averaging 19 locations and 11 dates per article (every 44 words and every 75 words, respectively). The connections between every date and every location were captured into a massive network representing Wikipedia’s view of history. With this instrumentation, Leetaru was able to perform near-real time analysis over the entire dataset on the SGI UV 2 to create visual maps throughout space and time to see not only how history unfolded but also the overall tone of the world throughout the last thousand years, and interactively testing a wide array of theories and research questions, all in less than a day’s work.
SGI UV 2 product family enables users to find answers to the world’s most difficult problems on a system as easy to administer as a workstation. Built with Intel Xeon processor E5 family, running standard Linux, and supporting a wide range of storage options, SGI UV 2 offers a complete, industry-standard solution for no-limit computing.
With as little as 16 cores and 32 gigabytes of memory, SGI UV 2 can start small and seamlessly expand. This next generation platform doubles the number of cores (up to 4096 cores) and quadruples the amount of coherent main memory (up to 64 terabytes) from the previous generation, available for in-memory computing in a single-image system. SGI UV 2 can scale to eight petabytes of shared memory and at a peak I/O rate of four terabytes per second (14 PB/hour) it could ingest the entire contents of the U.S. Library of Congress print collection in less than three seconds.
posted by GLaDOS
22nd June, 2012 6:50am
20th May, 2013 by Biztech2.com Staff
20th May, 2013 by Biztech2.com Staff
16th May, 2013 by Biztech2.com Staff
Fractal Analytics Named 'Cool Vendor In Analytics' By Gartner
![]()
MORE IN BUSINESS INTELLIGENCE
Understand The Business User Before Strategising BI Plans
21st May, 2013 by Robin Chatterjee
Amidst this complex environment of economic slowdown and enormous data-growth,...
CFO's Favourite Tech To Spend On: BI
20th May, 2013 by Biztech2.com Staff
Gartner survey finds that BI is becoming less of a CIO responsibility and more...
20th May, 2013 by Biztech2.com Staff
BI innovations from SAP to unlock the power of data for real-time insights and...
HP And SAP Advance SAP HANA Through Joint Innovation
20th May, 2013 by Biztech2.com Staff
Project Kraken enables customers to reduce the time required to rapidly...
Fractal Analytics Named 'Cool Vendor In Analytics' By Gartner
16th May, 2013 by Biztech2.com Staff
'Cool Vendor' is an annual report that identifies the new cool...

















Cool!