wwc

WORKSHOP ON WORKLOAD CHARACTERIZATION

Westin Park Central

Dallas, Texas

November 29, 1998

Held In Conjunction with

Micro-31: The 31^st Annual ACM/IEEE

International Symposium on Microarchitecture

Advance Program
A book containing most of these papers has been published by IEEE Computer Society under the title "Workload Characterization: Methodology and Case Studies", IEEE CS Press, 1999.

Session 1: Contemporary Workloads I: Java and Graphics (8am-9:30am)

Chair: Lizy Kurian John

"A Study of Code Reuse and Sharing Characteristics of Java Applications", Marie T. Conte, Andrew R. Trick, John C. Gyllenhaal, and Wen-mei W. Hwu (Computer and System Research Lab, University of Illinois at Urbana-Champaign) Abstract

"Characterization of Java Workloads by principal components analysis and indirect branches", Kingsum Chow, Adam Wright, and Konrad Lai (Microcomputer Research Labs, Intel Corporation) Abstract Full Paper

"Generation of 3D Graphics Workload for System Performance Analysis", Ali Poursepanj, and David Christie (CPU and System Architecture Group, AMD Corporation) Abstract Full Paper

"Analysis of Pro/ENGINEER V19 Bench98 Benchmark on a Platform Based on the Pentium II Xeon Processor and the 82440GX AGPset", Raed Kanjo (Intel Corporation) Abstract Full Paper

"Parameter Value Characterization of Windows NT-based Applications", John Kalamatianos, Ronnie Chaiken*, and David Kaeli (Dept. of ECE, NorthEastern University, and Microsoft Research*) Abstract Full Paper

Session 2: I/O and Memory (10am - 11am)

Chair: Ann Marie Grizzaffi Maynard

"Application Environments & I/O Workload Characterization for Today and Tomorrow", Todd Boyd, and Renato Recio (IBM Corporation) Abstract Full Paper

"Self-similarity in I/O workload: analysis and modeling", Maria E. Lopez, and Vicente Santonja (Departmento de Ingenier de Sistemas, Computadores y Autom ica (DISCA), Universidad Polit nica de Valencia (UPV)) Abstract Full Paper

"Memory Access Pattern Analysis", Mary D. Brown, Roy M. Jenevein, and Nasr Ullah (System Performance and Modeling, Motorola Inc) Abstract Full Paper

"Characterizing Instruction Latency for Speculative Issue SMPs: A Case Study of Varying Memory System Performance on the SPLASH-2 Benchmarks", Brian Grayson, and Craig Chase (Dept. of ECE, UT Austin) Abstract Full Paper

11am Keynote: Yale Patt

Session 3: Contemporary Workloads II: Data Mining and Web Servers (1pm-2pm)

Chair: Pradip Bose

"Performance and Memory-Access Characterization of Data Mining Applications", Jeffrey P. Bradford, and Jose Fortes (Dept. of Electrical and Computer Engineering, Purdue University) Abstract Full Paper

"Memory Characterization of a Parallel Data Mining Workload", Jin-Soo Kim, Xiaohon Qin, and Yarsun Hsu (IBM T.J. Watson Research Center) Abstract Full Paper

"Characterizing Response Times of WWW Caching Proxy Servers", Cristina Duarte Murta, and Virgilio A. F. Almeida (Computer Science Department, Federal University of Minas Gerais, Brazil) Abstract Full Paper

"Characterizing the Behavior of Windows NT Web Server Workloads Using Processor Performance Counters", Ramesh Radhakrishnan, and Freeman L. Rawson* (UT Austin, and IBM Austin Research Laboratory*) Abstract Full Paper

Session 4: Measurement Methodology (2pm-2:30pm)

Chair: Pradip Bose

"Trace Sampling for Desktop Applications on Windows NT", Patrick J. Crowley, and Jean-Loup Baer (Dept. of Computer Science and Engineering, University of Washington) Abstract Full Paper

"Instruction-level Characterization of Scientific Computing Application using Hardware Performance Counters", Yong Luo, and Kirk W. Cameron (Scientific Computing Group, Los Alamos National Laboratory) Abstract Full Paper

Abstracts

Session 1: Contemporary Workloads I: Java and Graphics

A Study of Code Reuse and Sharing Characteristics of Java Applications

Marie T. Conte {mconte@crhc.uiuc.edu}, Andrew R. Trick {atrick@crhc.uiuc.edu}, John C. Gyllenhaal {gyllen@crhc.uiuc.edu}, and Wen-mei W. Hwu {hwu@crhc.uiuc.edu} (Computer and System Research Lab, University of Illinois at Urbana-Champaign)

Abstract: This paper presents a detailed characterization of Java application and applet workloads in terms of reuse and sharing of Java code at the program, class, and method level. In order to expose more sharing opportunities, techniques for detecting code equivalence even in the presence of minor code changes or constant pool index differences are also proposed and examined. The analyzed application workload consists of the recently released SPECjvm98 benchmarks and the applet workload is derived from three extensive searches of the Internet between May 1997 and May 1998 using an enhanced web crawler. Analysis of these workloads reveals several new code sharing and optimization opportunities.

Back to Session 1

Characterization of Java Workloads by Principal Components Analysis and Indirect Branches

Kingsum Chow {kingsum.chow@intel.com}, Adam Wright, and Konrad Lai (Microcomputer Research Labs, Intel Corporation)

Abstract: This paper compares workloads from the emerging Java workloads (e.g. VolanoMark, SysmarkJ, SpecJVM98 and Jmark 2) with non-Java workloads (e.g. FSPEC95, ISPEC95/98 and Sysmark32/98) through the use of various multivariate data analysis techniques on data collected from about one thousand traces on Pentium? Pro systems. Among the counters measured, the most significant difference between Java and non-Java workloads is the density of indirect branches. Upon closer inspection, it was determined that most Java workloads branching behavior is not any worse than a few poorly behaved ISPEC95 benchmarks such as gcc and perl. This paper shows the effectiveness of using Principal Components Analysis in screening and categorizing workload statistics as well as some interesting patterns of indirect branches of Java workloads.

Back to Session 1

Generation of 3D Graphics Workload for System Performance Analysis

Ali Poursepanj {ali.poursepanj@amd.com}, and David Christie {david.christie@amd.com} (CPU and System Architecture Group, AMD Corporation)

Abstract: Generation of representative workloads for system performance models has been a challenge for PC system architects who are using trace driven models. Unlike processor performance models that typically only use a single CPU instruction trace, system models in most cases require traces of CPU, Advanced Graphics Port (AGP), PCI, and other bus mastering devices that can access memory. A common approach is to collect bus traces with a logic analyzer. Although this allows generation of realistic traces, typical analyzer buffer sizes seriously limit the length of contiguous traces. Another problem is that traces collected in a specific system configuration may not be representative of other systems, especially future systems with different timings and/or bus protocols. This paper presents an overview of an approach that can be used to generate long bus traces for performance model stimulus. We describe methods for characterization of system behavior and generation of accurate synthetic graphics traces based on real traces, and give examples of correlated CPU and AGP traces that are synthetic but reflect the characteristics of real CPU/AGP traces.

Back to Session 1

Analysis of Pro/ENGINEER V19 Bench98 Benchmark on a Platform Based on the Pentium(R) II Xeon(TM) Processor and the 82440GX AGPset

Raed Kanjo {raed.kanjo@intel.com} (Intel Corporation)

Abstract: This paper characterizes the behavior of the latest Pro/ENGINEER V19 benchmark, Bench98(TM), on an IA-32 workstation platform based on the recently-introduced Pentium(R) II Xeon(TM) processor and 82440GX AGPset. The paper investigates the sensitivity of the benchmark to the size and/or speed of various platform components and analyzes the floating point, branching, caching, and memory behavior of the benchmark, comparing the results to the measurements obtained using SPEC95.

Back to Session 1

Parameter Value Characterization of Windows NT-based Applications

John Kalamatianos, Ronnie Chaiken*, and David Kaeli {kaeli@ece.neu.edu} (Dept. of ECE, NorthEastern University, and Microsoft Research*)

Abstract: Compiler optimizations such as code specialization and partial evaluation can be used to effectively exploit identifiable invariance of variable values. To identify the invariant variables that the compiler misses at compile time, value profiling can provide valuable information. In this paper we focus on the invariance of procedure parameters for a set of desktop applications run on MS Windows NT 4.0. Most of those applications are non-scientific and execute interactively through a rich GUI. Due to the dynamic nature of this workload one would expect that parameter values would also exhibit an unpredictable behavior. Our work attempts to address this question by measuring the invariance and temporal locality of parameter values. We also measure the invariance of parameter values for four benchmarks from the SPECINT95 suite for comparison

Back to Session 1

Session 2: I/O and Memory

Application Environments & I/O Workload Characterization for Today and Tomorrow

Todd Boyd {wtboyd@us.ibm.com}, and Renato Recio {recio@us.ibm.com} (IBM Corporation)

Abstract: The design and development of future I/O subsystems needs to keep pace with the rapid rate of improvement in microprocessor technology and changes in system structure. In order to analyze the potential bottlenecks of I/O subsystems we must first identify and characterize the various workloads that will run on these future systems. This paper has two major goals. The first is to identify and analyze the application environments that are presently being implemented throughout the computing industry. The second goal of the paper is to identify and summarize the I/O subsystem characteristics of various present-day and future workloads that typify these application environments.

Back to Session 2

Self-similarity in I/O workload: analysis and modeling

Maria E. Lopez {megomez@disca.upv.es}, and Vicente Santonja {visan@disca.upv.es} (Departmento de Ingenier de Sistemas, Computadores y Autom ica (DISCA), Universidad Polit nica de Valencia (UPV))

Abstract: Recently the notion of self-similarity has been applied to wide-area and local-area network traffic. This paper demonstrates that disk-level I/O requests are self-similar in nature. We show evidences, visual and mathematical, that the I/O accesses are consistent with self-similarity. Moreover, we show that this property of I/O accesses is mainly due to writes. For our experiments, we use two sets of traces that collect the disk activity from two systems during a period of two months. Such behavior has serious implications for performance evaluation of storage subsystem designs and implementations, since commonly-used simplifying assumptions about workload characteristics (e.g., Poisson arrivals) are shown to be incorrect. Using the ON/OFF model, we implement a disk request generator. The inputs of this generator are the measured properties of the available trace data. We analyze the synthesized workload, and confirm that it exhibits the correct self-similar behavior.

Back to Session 2

Memory Access Pattern Analysis

Mary D. Brown { mdb@umich.edu}, Roy M. Jenevein {jenevein@ibmoto.com}, and Nasr Ullah {nasr_ullah@email.sps.mot.com} (System Performance and Modeling, Motorola Inc)

Abstract: A methodology for analyzing memory behavior has been developed for the purpose of evaluating memory system design. MPAT, a memory pattern analysis tool, has been used to profile memory transactions of dynamic instruction traces. First, the memory model and means of gathering performance metrics are discussed. Then the metrics are evaluated in order to measure the utilization of the memory system and determine what changes should be made to improve memory system performance.

Back to Session 2

Characterizing Instruction Latency for Speculative Issue SMPs: A Case Study of Varying Memory System Performance on the SPLASH-2 Benchmarks

Brian Grayson {bgrayson@ece.utexas.edu}, and Craig Chase (Dept. of ECE, UT Austin)

Abstract: Out-of-order, speculative, superscalar processors are complex. The behavior of multiprocessor systems that use such processors is not well understood and very difficult to predict. We tackle this problem using a powerful simulator, Armadillo, and a novel characterization framework that breaks the instruction pipeline into five meta-stages. The Armadillo simulator models symmetric multiprocessors (SMPs) constructed from highly aggressive superscalar processors on a shared bus, and is able to provide accurate, detailed statistics on numerous aspects of the simulated system, including the amount of time each instruction spends in each of these five meta-stages. We also analyze the fraction of each instruction's lifetime during which it remains speculative and the amount of time that an instruction spends on the critical path. To demonstrate the effectiveness of this approach, we apply the characterization to applications from the SPLASH-2 benchmark suite. We evaluated the applications' sensitivity to key memory system parameters: bus frequency, bus width, memory latency, and cache latency.

Back to Session 2

Session 3: Contemporary Workloads II: Data Mining and Web Servers

Performance and Memory-Access Characterization of Data Mining Applications

Jeffrey P. Bradford {jeffrey.bradford@computer.org}, and Jose Fortes {fortes@ecn.purdue.edu} (Dept. of Electrical and Computer Engineering, Purdue University)

Abstract: This paper characterizes the performance and memory-access behavior of a decision tree induction program, a previously unstudied application used in data mining and knowledge discovery in databases. Performance is studied via RSIM, an execution driven simulator, for three uniprocessor models that exploit Instruction Level Parallelism (ILP) to varying degrees. Several properties of the program are noted. Out-of-order dispatch and multiple-issue provide a significant performance advantage: 50%-250% improvement in IPC for out-of-order versus in-order, and 5%-120% improvement in IPC for four-way issue versus single-issue. Multiple-issue provides a greater performance improvement for larger L2 cache sizes, when the program is limited by CPU performance; out-of-order dispatch provides a greater performance improvement for smaller L2 cache sizes. The program has a very small instruction footprint: an 8-kB L1 instruction cache is sufficient to bring the instruction miss rate below 0.1. A small (8 kB) L1 data cache is sufficient to capture most of the locality, resulting in a L1 miss rates between 10%-20%. Increasing the size of the L2 data cache does not significantly improve performance until a significant fraction (over 1/4) of the dataset fits into the L2 cache. In addition, a procedure is developed for scaling the cache sizes when using scaled-down datasets, allowing the results for smaller datasets to be used to predict the performance of full-sized datasets.

Back to Session 3

Memory Characterization of a Parallel Data Mining Workload

Jin-Soo Kim {jinsoo@watson.ibm.com}, Xiaohon Qin {xqin@watson.ibm.com}, and Yarsun Hsu {hsu@watson.ibm.com} (IBM T.J. Watson Research Center)

Abstract: This paper studies a representative of an important class of emerging applications, a parallel data mining workload. The application, extracted from the IBM Intelligent Miner, identifies groups of records that are mathematically similar based on a neural network model called self-organizing map. We examine and compare in details two implementations of the application: (1) temporal locality or working set sizes; (2) spatial locality and memory block utilization; (3) communication characteristics and scalability; and (4) TLB performance.

First, we find that the working set hierarchy of the application is governed by two parameters, namely the size of an input record and the size of prototype array; it is independent of the number of input records. Second, the application shows good spatial locality, with the implementation optimized for sparse data sets having slightly worse spatial locality. Third, due to the batch update scheme, the application bears very low communication. Finally, a 2-way set associative TLB may result in severely skewed TLB performance in a multiprocessor environment caused by the large discrepancy in the amount of conflict misses. Increasing the set associatively is more effective in mitigating the problem than increasing the TLB size.

Back to Session 3

Characterizing Response Time of WWW Caching Proxy Servers

Cristina Duarte Murta {cristina@dcc.ufmg.br}, and Virgilio A. F. Almeida {virgilio@dcc.ufmg.br} (Computer Science Department, Federal University of Minas Gerais, Brazil)

Abstract: Caching proxies have an important role in the Web infrastructure. They save network traffic and reduce Web latency. While they have been largely deployed in the WWW, little is known about Web proxy behavior and in particular about international proxies. This paper presents an analysis of caching proxy response times, based on logs from real proxies located in the USA, Europe and South America. We found that high variability is an invariant in caching response times across log data of different proxies. Then, we show that the high variability can be explained through a combination of factors such as the high variability in file sizes and bandwidth of the client links to the caching proxies. Finally, we discuss the implications of high variability in the proxy behavior on performance characterization and modeling.

Back to Session 3

Characterizing the Behavior of Windows NT Web Server Workloads Using Processor Performance Counters

Ramesh Radhakrishnan {radhakri@ece.utexas.edu}, and Freeman L. Rawson* {frawson@us.ibm.com} (UT Austin, and IBM Austin Research Laboratory*)

Abstract: Our goal is to study the behavior of modern web servers and server application programs to understand how they interact with the underlying hardware and operating system (OS) environments. In our study we characterize the workload placed on both Pentium and Pentium Pro PCs running Windows NT Workstation 4.0 by three simple web serving scenarios using the processor timestamp and performance counters. We used both the Pentium and the Pentium Pro to investigate the effect on the workloads of two processors that have the same instruction-set architecture, but which have rather different microarchitectures. The workload shows a high percentage of branch instructions with only fair branch prediction for both processors. The numbers from the Pentium suggest a very low level of available instruction set parallelism at the instruction set architecture level while the improvement in the cycles per instruction (CPI) on the Pentium Pro indicates that there is more parallelism at the micro-operation level even though the code makes somewhat inefficient use of the available resources.

Back to Session 3

Session 4: Measurement Methodology

Trace Sampling for Desktop Applications on Windows NT

Patrick J. Crowley {pcrowley@cs.washington.edu}, and Jean-Loup Baer {baer@cs.washington.edu} (Dept. of Computer Science and Engineering, University of Washington)

Abstract: This paper examines trace sampling for a suite of desktop application traces on Windows NT. This paper makes two contributions: we compare the accuracy of several sampling techniques to determine cache miss rates for these workloads, and we present a victim cache architecture study that demonstrates that sampling can be used to drive such studies. Of the sampling techniques used for the cache miss ratio determinations, stitch, which assumes that the state of the cache at the beginning of a sample is the same as the state at the end of the previous sample, is the most effective for these workloads. This technique is more accurate than the others and is reliable for caches up to 64KB in size.

Back to Session 4

Instruction-level Characterization of Scientific Computing Application using Hardware Performance Counters

Yong Luo {yongl@lanl.gov}, and Kirk W. Cameron {kirk@lanl.gov} (Scientific Computing Group, Los Alamos National Laboratory)

Abstract: Recently, advanced microprocessors have incorporated hardware performance counters in their design allowing for new types of analysis via empirical methods. The goal of this analysis continues to be the discovery of analytical/empirical methods to evaluate performance of scaling codes on today's advanced CPU's and to predict effects of architectural advances on current applications. In this paper, we provide an instruction-level characterization derived empirically in an effort to demonstrate how architectural limitations in underlying hardware will affect the performance of existing codes. In particular, we focus on scientific applications of interest to the DOE ASCI (Accelerated Strategic Computing Initiative) community. Preliminary results provide promise in code characterization, and empirical/analytical modeling. These include the ability to quantify outstanding miss utilization and stall time attributable to architectural limitations in the CPU and the memory hierarchy. This work further promises insight into quantifying bounds for CPI0 or the ideal CPI with infinite L1 cache. In general, if we can characterize workloads using parameters that are independent of architecture, such as this work, then we can more appropriately compare different architectures in an effort to direct processor/code development.

Back to Session 4

Last Updated: November 24, 1998