MARCO GSRC T2 Fabrics: Bookshelf

New Data Formats For VLSI CAD

	Andrew B. Kahng, Igor Markov
	and GSRC participants

0.	Slides (ppt, ps) of the talk by Igor Markov at UCLA workshop(Nov 20-21, 1999)
I.	Introduction
II.	Why and When Use New Data Formats
III.	Motivation and Main Goals
IV.	Gotchas
V.	General Guidlines/Standards
VI.	Open issues
VII.	Availability Status of New Data Formats
VIII.	Resources

Appendix A. Note To Developers

I. Introduction

The MARCO/GSRC bookshelf aims at collecting leading-edge implementations for VLSI CAD algorithms as well as providing efficient mechanisms of evaluating such implementations and comparing them against each other. Standard benchmarks play an important role in such comparisons. Several research groups participating in the bookshelf effort are now converging on common representations for a number of fundamental optimization problems in VLSI CAD. Such representations include semantics (i.e., what kind of information is given), abstract syntax (i.e., how the information is organized to facilitate common use models) and concrete syntax (i.e., specific serial representations). At this point, we are not standardizing in-memory representations, but implementation options may implicitly influence the choice of common representations.

The new data formats are going to fill gaps where no public data formats existed before and also improve on existing, but deficient formats. They will be followed by publicly available benchmarks and simple utilities (e.g., converters, statistics browsers, evaluators and constraint verifiers).

Bookshelf maintainers (in the future, the steering committee) actively interact with the authors of the new formats in order to ensure relevance, domain coverage, user convenience and uniform look of all new formats. This document is designed to summarize our motivations and general guidelines in addition to listing the formats that are already available and can be used as examples.

II. Why And When Use New Data Formats (see separate page)

III. Motivation and Main Goals

Provide high-quality data formats to capture leading-edge optimization problems in different areas of VLSI CAD and make them easy to use both in academic and industrial settings.
Ensure that all data formats must be ASCII-based and maximally human-readable.
Put independent groups of attributes into separate files and connect such files with several-line "glue files" to ensure maximal lifespan of format components even when other components are not used.
Attempt genericity and reuse by design, i.e., construct single- and multi-file formats so that they can be used for multiple, possibly, unexpected, purposes.
Avoid arbitrary restrictions.
Attempt to simplify the task of writing parsers through common data format practices and through offering solutions to standard problems (e.g., implementation of hash tables).
Pay attention to detail, including carefull choice of equivalent syntactic constructs (e.g., one-liners versus begin/end), identifiers and default values. We will try to prevent confusion and misuses.
Describe the desired parser behavior, especially error diagnostics, when this appears critical.
Consider use scenarios for proposed file formats and ensure that the general cases result in only minimal overhead in degenerate cases (e.g., careful defaults versus specifying inexistant details as "None")
Pursue more fundamental issues first, ensuring modularily, reuse and extensibility. We attempt to follow the analogy to the C and C++ programming languages where the main functionalities reside in standard libraries and do not affect the complexity of the [core of the] language.

IV. Gotchas

Post your plans for writing new data formats to bookshelf-devel as well as early drafts of your format descriptions. This is to avoid overlapping with the work of others and detect possible misfeatures in your formats as early as possible. Think of Nicolas Bourbaki (i.e., defining concepts in clear and context-independent terms, preferrably, relying only on numbers, shapes and sets as fundamental concepts).
Put some thought into modeling your particular domain with generic mathematical constructs. Not only this results in clearer semantics and syntax, but also enables unexpected reuse.
Make sure you look at available format descriptions and try to reuse as much as reasonable.
The hardest issue in writing a parser is good error diagnostics. To help the parser, modularize your data formats and annotate types.
Explain what characters can be used in names used by your format and whether particular names are prohibited (e.g., you cannot have a variable called "for" in C because this is a reserved word). Clearly mention your case-[non]-sensitivty requirements.
Avoid redundant numerical information that is not useful for parsing and error diagnostics (e.g., the shape of a hard block and the area).
Avoid cryptic and confusing identifiers/declarators. In our environment, a good data format will be clear to a specialist without a manual.
Carefully balance between verbose and laconic identifiers/declarators. It is typically a good idea to use words instead of numbers when choosing one of several options (e.g., relative/absolute rather than 0/1).
When saving files in a particular format, generously pad variable-length fields with whitespace so that your files looks like a table. This dramatically improves human-readability.
When you require that several pieces of information be on the same line, make sure that everything fits, including possible generous whitespace padding in someone else's code.
Post sample instances as early as possible, but when doing so, clearly say whether the format is subject to change.
Publish a reference parser, either in source code or in binary for major platforms, to prevent others from producing non-standard instances. Ensure good error diagnostics in the reference parser (having good error diagnostics in any parser is justified, e.g., for debugging the parser on standard benchmarks ;-).
Grammar-based parsers, e.g., the ones that use lex/yacc (or flex/bison), often have very poor error diagnostics abilities. Do not be afraid of writing a parser in C++ --- it is very far from anathema --- some good parsers are written in C++ from ground up (e.g., the SGML parser SP).
When writing your parser in C++, do not forget about numerous string-processing functions in the C standard library, such as strstr (type 'man strstr' on any Unix system, or better see the 2nd edition of the Kernighan & Ritchie C book for a tour of stdlibc). User-defined C++ I/O manipulators appear very useful (see Stroustrup's 3rd edition or, better, "Ruminations on C++" by Koenig and Moo), especially for lookaheads.
When processing arbitrary names, you will most likely need an implementation of hash tables. While the standard C and C++ libraries mysteriously avoid hashing functions for character strings, the 3rd edition of the Stroustrup's C++ book proposes an interface to a hash-based container hash_map. This container is implemented in the SGI STL and included with g++ 2.95 and higher.
PERL is a very strong candidate for converters that do not populate in-memory representations. It will be interesting to see a PERL-based parser that populates in-memory representations in C++ (contact imarkov@cs.ucla.edu if you are thinking of writing such a parser).

V. General Guidlines/Standards

This is a list of general guidlines and standards that apply to all bookshelf formats, unless otherwise explicetly noted. Included here are definitions on legal characters in names, format for numbers, definitions of "whitespace", etc.
This page can be refered to in descriptions of file formats via
http://vlsicad.cs.ucla.edu/GSRC/bookshelf/formats/#V
and does not have to be duplicated.

Conventions

blank characters(whitespace) are spaces and tabs.
multiple blank characters are equivalent to one
a colon (:) must always be preceded and followed by a space
the pound sign (#) denotes commented out lines and is only guaranteed to be processed correctly if all characters earlier on the line are blank characters
names may contain upper- and lower-case characters, as well as {"_", ":", "-" and "|"}. Names are case-sensative! Keywords are not.
Every file has a "standard" header containing format name and revision (e.g., UCLA nets 1.0) in free format, the date and time of creation, the user who created the file (on OSes that support users, such as Windows 98/NT and Unix) and/or the software that created the file. This information can be embedded in comments at the beginning of the file and is ignored by the parser.
when the tokens expected to be found on the line are successfully parsed, all characters until the end of the line should be ignored (this allows for easy extensions); a one-time warning must be issued by the parser if any non-blank characters were ignored, "Non-blank characters are ignored until the end of line XXX and, possibly, later".
if the line-end character is encountered before all tokens are parsed, a fatal error message "Unexpected line end on line XXX" should generated.
all units with dimensions, such as locations, offsets, sizes, weights etc, can be specified as doubles. It is possible to use integers, but we felt that restricting to integers may be too risky as, e.g., LEFDEF specifies doubles. We believe that resolving 3.000000 vs 2.99999 (that may arise if a particular program expects an integer) is not difficult via rounding to integers and checking with a reasonably small round-off tolerance. This should be done by the programs that save doubles (i.e., test if a number to be saved is epsilon-close to an integer, and if it is, round it up before saving). Physical design tools that use integers internally should read doubles and check for overflow.
The format for glue files, and a discussion of platform-[in]dependence are given to encourage everyone to use the same .aux file format.

VI. Open Issues

s during conversion to floats.

Separating problem instances from evaluator information. Simple and explicit specification of evaluators w/o standard in-memory representations is rarely possible. References to descriptions in the literature can be too ambiguous to be used in file formats, they can be used wherever instances of given file formats are used (also, by reference).

We also attempt to express constraints separately from other information when such constraints cannot be explicitely formulated.

Conventions on units

Component reuse in different formats and cross-references.
This issue arises both at the level of whole files
- ... potential others, feel free to suggest to abk@cs.ucla.edu,imarkov@cs.ucla.edu
and at the level of data types within files
- orientation of standard cells in placement formats and blocks in block packing formats
- ... potential others, feel free to suggest to abk@cs.ucla.edu,imarkov@cs.ucla.edu
Pollution of common namespace (preventing several different .pin formats and preventing the use of the same declarators in different formats with confusing syntax/semantics).

VII. Availability Status of New Data Formats (see separate page)

VIII. Resources

XML resources

Appendix A. Note To Developers

Active bookshelf developers are strongly advised to request membership in the bookshelf group at GSRC with developer priviledges and use the bookshelf-devel mailing list to post their early formats drafts and announce implementation plans. Please do not post questions before browsing archives. For implementation, porting, installation and configuration issues, consider requesting membership in the softdevel group at GSRC and using their mailing list.

MARCO GSRC T2 Fabrics: Bookshelf

New Data Formats For VLSI CAD

Contents