VAST 2007 Preprocessed Data Georges Grinstein, Curran Kelleher, Chris Deveau Institute for Visualization and Perception Research University of Massachusetts at Lowell (March 7th, 2007) NOTE: All other texts were processed on a UNIX based machine. You may need to change your text editors settings to treat these included files as a UNIX text file instead of a DOS text file. Also the accuracy of the named entity tags cannot be 100% guaranteed. We are very confident though that the tagging system we use is fairly accurate. Just be on the look out for strange tags. We used MITRE’s Alembic with some modifications and hand work. See http://www.mitre.org/tech/alembic-workbench/ for details on Alembic Added to the original data set described in Analysts' Brief and Contest Instructions v3.doc are: BlogText -------- This folder includes text rips from the two blogs included in the data set. The text rips have also been tagged. News_Tagged ----------- Tagged news articles. These files include tags which will be described in more depth below. .txt.p.NAMED_ENTITY these are tab-delimited tables of named entity mentions. format: [???] [paragraph ID] [mention ID] [entity ID] [entity type] [numeric value] [string] [char offset] [string length] [???] there are simply always three useless question marks beginning every line [paragraph ID] the ID of the paragraph in which this mention occurred [mention ID] the ID of this particular mention [entity ID] the ID of the entity itself, within the document only, not unique across all documents [entity type] a three character string describing the type of entity DAT = date, LOC = location, MON = monetary value, ORG = organization, PER = person, TIM = time [numeric value] the numeric value associated with this entity, or empty if there is no associated numeric value for example, this field would be 30000 for the entity whose string is "$ 30,000" [string] the string of this entity mention, for example "Christine Gregoire" [char offset] the character offset of this mention [string length] the length of the string of this entity mention .txt.p.NE these are the articles with all of the entities tagged in the MUC SGML format. The paragraphs are tagged as well, and assigned IDs .txt.p.NORMALIZED_ENTITY these are tab-delimited tables of named entities, aggregated across mentions. format: [???] [entity ID] [entity type] [additional information] [string] [additional information] usually this is blank, but is filled in some cases for example, it will be CORP if Alembic has determined the entity to be a corporation .txt.p.SYN these are the articles with syntax structures tagged. News_Text --------- The news text in a more readable ascii text format. ( No xml tags ) Entities_*.txt -------------- These files are csv files of all the entities our system found in the news and blog we processed. DAT are dates. MON is money. ORG is for organazations. TIM is for date references. And finally person is for person tags. AllEntities.txt has all the above data in one single file. The files are set up in this format. Name Tag Type Number of files in List of files tag is located in VastembicProcessing ------------------- Contains a jar executable along with source for our file indexer. Instructions below. Index the data first: File -> Build index -> Select the "news-original" directory Indexing should take about 30 seconds Now it is ready to use; type search terms and hit 'enter' or press the 'Search!' button Double clicking on a result will open that text file in a text window, with the search term highlited in the text WARNING ------- ONLY THE TEXT WAS PROCESSED, so REMEMBER TO LOOK AT THE IMAGES AS WELL and add the corresponding metadata if needed. Any questions: mail to Grinstein@cs.uml.edu