W
ORKSHOP ON WORKLOAD CHARACTERIZATION
Westin Park Central |
Dallas, Texas |
November 29, 1998
Held In Conjunction with
Micro-31: The 31st Annual ACM/IEEE
International Symposium on
MicroarchitectureAdvance Program
A book containing most of these papers has been published by IEEE Computer Society under the title "Workload Characterization: Methodology and Case Studies", IEEE CS Press, 1999.
Session 1: Contemporary Workloads I: Java and Graphics (8am-9:30am)
Chair: Lizy Kurian John
"A Study of Code Reuse and Sharing Characteristics of Java Applications", Marie T. Conte, Andrew R. Trick, John C. Gyllenhaal, and Wen-mei W. Hwu (Computer and System Research Lab, University of Illinois at Urbana-Champaign) Abstract
"Characterization of Java Workloads by principal components analysis and indirect branches", Kingsum Chow, Adam Wright, and Konrad Lai (Microcomputer Research Labs, Intel Corporation) Abstract Full Paper
"Generation of 3D Graphics Workload for System Performance Analysis", Ali Poursepanj, and David Christie (CPU and System Architecture Group, AMD Corporation) Abstract Full Paper
"Analysis of Pro/ENGINEER V19 Bench98 Benchmark on a Platform Based on the Pentium II Xeon Processor and the 82440GX AGPset", Raed Kanjo (Intel Corporation) Abstract Full Paper
"Parameter Value Characterization of Windows NT-based Applications", John Kalamatianos, Ronnie Chaiken*, and David Kaeli (Dept. of ECE, NorthEastern University, and Microsoft Research*) Abstract Full Paper
Session 2: I/O and Memory (10am - 11am)
Chair: Ann Marie Grizzaffi Maynard
"Application Environments & I/O Workload Characterization for Today and Tomorrow", Todd Boyd, and Renato Recio (IBM Corporation) Abstract Full Paper
"Self-similarity in I/O workload: analysis and modeling", Maria E. Lopez, and Vicente Santonja (Departmento de Ingenier de Sistemas, Computadores y Autom ica (DISCA), Universidad Polit nica de Valencia (UPV)) Abstract Full Paper
"Memory Access Pattern Analysis", Mary D. Brown, Roy M. Jenevein, and Nasr Ullah (System Performance and Modeling, Motorola Inc) Abstract Full Paper
"Characterizing Instruction Latency for Speculative Issue SMPs: A Case Study of Varying Memory System Performance on the SPLASH-2 Benchmarks", Brian Grayson, and Craig Chase (Dept. of ECE, UT Austin) Abstract Full Paper
11am Keynote: Yale Patt
Session 3: Contemporary Workloads II: Data Mining and Web Servers (1pm-2pm)
Chair: Pradip Bose
"Performance and Memory-Access Characterization of Data Mining Applications", Jeffrey P. Bradford, and Jose Fortes (Dept. of Electrical and Computer Engineering, Purdue University) Abstract Full Paper
"Memory Characterization of a Parallel Data Mining Workload", Jin-Soo Kim, Xiaohon Qin, and Yarsun Hsu (IBM T.J. Watson Research Center) Abstract Full Paper
"Characterizing Response Times of WWW Caching Proxy Servers", Cristina Duarte Murta, and Virgilio A. F. Almeida (Computer Science Department, Federal University of Minas Gerais, Brazil) Abstract Full Paper
"Characterizing the Behavior of Windows NT Web Server Workloads Using Processor Performance Counters", Ramesh Radhakrishnan, and Freeman L. Rawson* (UT Austin, and IBM Austin Research Laboratory*) Abstract Full Paper
Session 4: Measurement Methodology (2pm-2:30pm)
Chair: Pradip Bose
"Trace Sampling for Desktop Applications on Windows NT", Patrick J. Crowley, and Jean-Loup Baer (Dept. of Computer Science and Engineering, University of Washington) Abstract Full Paper
"Instruction-level Characterization of Scientific Computing Application using Hardware Performance Counters", Yong Luo, and Kirk W. Cameron (Scientific Computing Group, Los Alamos National Laboratory) Abstract Full Paper
Abstracts
Session 1: Contemporary Workloads I: Java and Graphics
A Study of Code Reuse and Sharing Characteristics of Java Applications
Marie T. Conte {
mconte@crhc.uiuc.edu}, Andrew R. Trick {atrick@crhc.uiuc.edu}, John C. Gyllenhaal {gyllen@crhc.uiuc.edu}, and Wen-mei W. Hwu {hwu@crhc.uiuc.edu} (Computer and System Research Lab, University of Illinois at Urbana-Champaign)Abstract:
This paper presents a detailed characterization of Java application and applet workloads in terms of reuse and sharing of Java code at the program, class, and method level. In order to expose more sharing opportunities, techniques for detecting code equivalence even in the presence of minor code changes or constant pool index differences are also proposed and examined. The analyzed application workload consists of the recently released SPECjvm98 benchmarks and the applet workload is derived from three extensive searches of the Internet between May 1997 and May 1998 using an enhanced web crawler. Analysis of these workloads reveals several new code sharing and optimization opportunities.Characterization of Java Workloads by Principal Components Analysis and Indirect Branches
Kingsum Chow {
kingsum.chow@intel.com}, Adam Wright, and Konrad Lai (Microcomputer Research Labs, Intel Corporation)Abstract:
This paper compares workloads from the emerging Java workloads (e.g. VolanoMark, SysmarkJ, SpecJVM98 and Jmark 2) with non-Java workloads (e.g. FSPEC95, ISPEC95/98 and Sysmark32/98) through the use of various multivariate data analysis techniques on data collected from about one thousand traces on Pentium? Pro systems. Among the counters measured, the most significant difference between Java and non-Java workloads is the density of indirect branches. Upon closer inspection, it was determined that most Java workloads branching behavior is not any worse than a few poorly behaved ISPEC95 benchmarks such as gcc and perl. This paper shows the effectiveness of using Principal Components Analysis in screening and categorizing workload statistics as well as some interesting patterns of indirect branches of Java workloads.
Generation of 3D Graphics Workload for System Performance Analysis
Ali Poursepanj {
ali.poursepanj@amd.com}, and David Christie {david.christie@amd.com} (CPU and System Architecture Group, AMD Corporation)Abstract:
Generation of representative workloads for system performance models has been a challenge for PC system architects who are using trace driven models. Unlike processor performance models that typically only use a single CPU instruction trace, system models in most cases require traces of CPU, Advanced Graphics Port (AGP), PCI, and other bus mastering devices that can access memory. A common approach is to collect bus traces with a logic analyzer. Although this allows generation of realistic traces, typical analyzer buffer sizes seriously limit the length of contiguous traces. Another problem is that traces collected in a specific system configuration may not be representative of other systems, especially future systems with different timings and/or bus protocols. This paper presents an overview of an approach that can be used to generate long bus traces for performance model stimulus. We describe methods for characterization of system behavior and generation of accurate synthetic graphics traces based on real traces, and give examples of correlated CPU and AGP traces that are synthetic but reflect the characteristics of real CPU/AGP traces.Analysis of Pro/ENGINEER V19 Bench98 Benchmark on a Platform Based on the Pentium(R) II Xeon(TM) Processor and the 82440GX AGPset
Raed Kanjo {
raed.kanjo@intel.com} (Intel Corporation)Abstract:
This paper characterizes the behavior of the latest Pro/ENGINEER V19 benchmark, Bench98(TM), on an IA-32 workstation platform based on the recently-introduced Pentium(R) II Xeon(TM) processor and 82440GX AGPset. The paper investigates the sensitivity of the benchmark to the size and/or speed of various platform components and analyzes the floating point, branching, caching, and memory behavior of the benchmark, comparing the results to the measurements obtained using SPEC95.Parameter Value Characterization of Windows NT-based Applications
John Kalamatianos, Ronnie Chaiken*, and David Kaeli {
kaeli@ece.neu.edu} (Dept. of ECE, NorthEastern University, and Microsoft Research*)Abstract:
Compiler optimizations such as code specialization and partial evaluation can be used to effectively exploit identifiable invariance of variable values. To identify the invariant variables that the compiler misses at compile time, value profiling can provide valuable information. In this paper we focus on the invariance of procedure parameters for a set of desktop applications run on MS Windows NT 4.0. Most of those applications are non-scientific and execute interactively through a rich GUI. Due to the dynamic nature of this workload one would expect that parameter values would also exhibit an unpredictable behavior. Our work attempts to address this question by measuring the invariance and temporal locality of parameter values. We also measure the invariance of parameter values for four benchmarks from the SPECINT95 suite for comparison
Session 2: I/O and Memory
Application Environments & I/O Workload Characterization for Today and Tomorrow
Todd Boyd {
wtboyd@us.ibm.com}, and Renato Recio {recio@us.ibm.com} (IBM Corporation)Abstract:
The design and development of future I/O subsystems needs to keep pace with the rapid rate of improvement in microprocessor technology and changes in system structure. In order to analyze the potential bottlenecks of I/O subsystems we must first identify and characterize the various workloads that will run on these future systems. This paper has two major goals. The first is to identify and analyze the application environments that are presently being implemented throughout the computing industry. The second goal of the paper is to identify and summarize the I/O subsystem characteristics of various present-day and future workloads that typify these application environments.Self-similarity in I/O workload: analysis and modeling
Maria E. Lopez {
megomez@disca.upv.es}, and Vicente Santonja {visan@disca.upv.es} (Departmento de Ingenier de Sistemas, Computadores y Autom ica (DISCA), Universidad Polit nica de Valencia (UPV))Abstract:
Recently the notion of self-similarity has been applied to wide-area and local-area network traffic. This paper demonstrates that disk-level I/O requests are self-similar in nature. We show evidences, visual and mathematical, that the I/O accesses are consistent with self-similarity. Moreover, we show that this property of I/O accesses is mainly due to writes. For our experiments, we use two sets of traces that collect the disk activity from two systems during a period of two months. Such behavior has serious implications for performance evaluation of storage subsystem designs and implementations, since commonly-used simplifying assumptions about workload characteristics (e.g., Poisson arrivals) are shown to be incorrect. Using the ON/OFF model, we implement a disk request generator. The inputs of this generator are the measured properties of the available trace data. We analyze the synthesized workload, and confirm that it exhibits the correct self-similar behavior.Memory Access Pattern Analysis
Mary D. Brown {
mdb@umich.edu}, Roy M. Jenevein {jenevein@ibmoto.com}, and Nasr Ullah {nasr_ullah@email.sps.mot.com} (System Performance and Modeling, Motorola Inc)Abstract:
A methodology for analyzing memory behavior has been developed for the purpose of evaluating memory system design. MPAT, a memory pattern analysis tool, has been used to profile memory transactions of dynamic instruction traces. First, the memory model and means of gathering performance metrics are discussed. Then the metrics are evaluated in order to measure the utilization of the memory system and determine what changes should be made to improve memory system performance.Characterizing Instruction Latency for Speculative Issue SMPs: A Case Study of Varying Memory System Performance on the SPLASH-2 Benchmarks
Brian Grayson {
bgrayson@ece.utexas.edu}, and Craig Chase (Dept. of ECE, UT Austin)Abstract
: Out-of-order, speculative, superscalar processors are complex. The behavior of multiprocessor systems that use such processors is not well understood and very difficult to predict. We tackle this problem using a powerful simulator, Armadillo, and a novel characterization framework that breaks the instruction pipeline into five meta-stages. The Armadillo simulator models symmetric multiprocessors (SMPs) constructed from highly aggressive superscalar processors on a shared bus, and is able to provide accurate, detailed statistics on numerous aspects of the simulated system, including the amount of time each instruction spends in each of these five meta-stages. We also analyze the fraction of each instruction's lifetime during which it remains speculative and the amount of time that an instruction spends on the critical path. To demonstrate the effectiveness of this approach, we apply the characterization to applications from the SPLASH-2 benchmark suite. We evaluated the applications' sensitivity to key memory system parameters: bus frequency, bus width, memory latency, and cache latency.
Session 3: Contemporary Workloads II: Data Mining and Web Servers
Performance and Memory-Access Characterization of Data Mining Applications
Jeffrey P. Bradford {
jeffrey.bradford@computer.org}, and Jose Fortes {fortes@ecn.purdue.edu} (Dept. of Electrical and Computer Engineering, Purdue University)Abstract:
This paper characterizes the performance and memory-access behavior of a decision tree induction program, a previously unstudied application used in data mining and knowledge discovery in databases. Performance is studied via RSIM, an execution driven simulator, for three uniprocessor models that exploit Instruction Level Parallelism (ILP) to varying degrees. Several properties of the program are noted. Out-of-order dispatch and multiple-issue provide a significant performance advantage: 50%-250% improvement in IPC for out-of-order versus in-order, and 5%-120% improvement in IPC for four-way issue versus single-issue. Multiple-issue provides a greater performance improvement for larger L2 cache sizes, when the program is limited by CPU performance; out-of-order dispatch provides a greater performance improvement for smaller L2 cache sizes. The program has a very small instruction footprint: an 8-kB L1 instruction cache is sufficient to bring the instruction miss rate below 0.1. A small (8 kB) L1 data cache is sufficient to capture most of the locality, resulting in a L1 miss rates between 10%-20%. Increasing the size of the L2 data cache does not significantly improve performance until a significant fraction (over 1/4) of the dataset fits into the L2 cache. In addition, a procedure is developed for scaling the cache sizes when using scaled-down datasets, allowing the results for smaller datasets to be used to predict the performance of full-sized datasets.Memory Characterization of a Parallel Data Mining Workload
Jin-Soo Kim {
jinsoo@watson.ibm.com}, Xiaohon Qin {xqin@watson.ibm.com}, and Yarsun Hsu {hsu@watson.ibm.com} (IBM T.J. Watson Research Center)Abstract:
This paper studies a representative of an important class of emerging applications, a parallel data mining workload. The application, extracted from the IBM Intelligent Miner, identifies groups of records that are mathematically similar based on a neural network model called self-organizing map. We examine and compare in details two implementations of the application: (1) temporal locality or working set sizes; (2) spatial locality and memory block utilization; (3) communication characteristics and scalability; and (4) TLB performance.First, we find that the working set hierarchy of the application is governed by two parameters, namely the size of an input record and the size of prototype array; it is independent of the number of input records. Second, the application shows good spatial locality, with the implementation optimized for sparse data sets having slightly worse spatial locality. Third, due to the batch update scheme, the application bears very low communication. Finally, a 2-way set associative TLB may result in severely skewed TLB performance in a multiprocessor environment caused by the large discrepancy in the amount of conflict misses. Increasing the set associatively is more effective in mitigating the problem than increasing the TLB size.
Characterizing Response Time of WWW Caching Proxy Servers
Cristina Duarte Murta {
cristina@dcc.ufmg.br}, and Virgilio A. F. Almeida {virgilio@dcc.ufmg.br} (Computer Science Department, Federal University of Minas Gerais, Brazil)Abstract:
Caching proxies have an important role in the Web infrastructure. They save network traffic and reduce Web latency. While they have been largely deployed in the WWW, little is known about Web proxy behavior and in particular about international proxies. This paper presents an analysis of caching proxy response times, based on logs from real proxies located in the USA, Europe and South America. We found that high variability is an invariant in caching response times across log data of different proxies. Then, we show that the high variability can be explained through a combination of factors such as the high variability in file sizes and bandwidth of the client links to the caching proxies. Finally, we discuss the implications of high variability in the proxy behavior on performance characterization and modeling.Characterizing the Behavior of Windows NT Web Server Workloads Using Processor Performance Counters
Ramesh Radhakrishnan {
radhakri@ece.utexas.edu}, and Freeman L. Rawson* {frawson@us.ibm.com} (UT Austin, and IBM Austin Research Laboratory*)Abstract
: Our goal is to study the behavior of modern web servers and server application programs to understand how they interact with the underlying hardware and operating system (OS) environments. In our study we characterize the workload placed on both Pentium and Pentium Pro PCs running Windows NT Workstation 4.0 by three simple web serving scenarios using the processor timestamp and performance counters. We used both the Pentium and the Pentium Pro to investigate the effect on the workloads of two processors that have the same instruction-set architecture, but which have rather different microarchitectures. The workload shows a high percentage of branch instructions with only fair branch prediction for both processors. The numbers from the Pentium suggest a very low level of available instruction set parallelism at the instruction set architecture level while the improvement in the cycles per instruction (CPI) on the Pentium Pro indicates that there is more parallelism at the micro-operation level even though the code makes somewhat inefficient use of the available resources.Session 4: Measurement Methodology
Trace Sampling for Desktop Applications on Windows NT
Patrick J. Crowley {
pcrowley@cs.washington.edu}, and Jean-Loup Baer {baer@cs.washington.edu} (Dept. of Computer Science and Engineering, University of Washington)Abstract:
This paper examines trace sampling for a suite of desktop application traces on Windows NT. This paper makes two contributions: we compare the accuracy of several sampling techniques to determine cache miss rates for these workloads, and we present a victim cache architecture study that demonstrates that sampling can be used to drive such studies. Of the sampling techniques used for the cache miss ratio determinations, stitch, which assumes that the state of the cache at the beginning of a sample is the same as the state at the end of the previous sample, is the most effective for these workloads. This technique is more accurate than the others and is reliable for caches up to 64KB in size.Instruction-level Characterization of Scientific Computing Application using Hardware Performance Counters
Yong Luo {
yongl@lanl.gov}, and Kirk W. Cameron {kirk@lanl.gov} (Scientific Computing Group, Los Alamos National Laboratory)Abstract:
Recently, advanced microprocessors have incorporated hardware performance counters in their design allowing for new types of analysis via empirical methods. The goal of this analysis continues to be the discovery of analytical/empirical methods to evaluate performance of scaling codes on today's advanced CPU's and to predict effects of architectural advances on current applications. In this paper, we provide an instruction-level characterization derived empirically in an effort to demonstrate how architectural limitations in underlying hardware will affect the performance of existing codes. In particular, we focus on scientific applications of interest to the DOE ASCI (Accelerated Strategic Computing Initiative) community. Preliminary results provide promise in code characterization, and empirical/analytical modeling. These include the ability to quantify outstanding miss utilization and stall time attributable to architectural limitations in the CPU and the memory hierarchy. This work further promises insight into quantifying bounds for CPI0 or the ideal CPI with infinite L1 cache. In general, if we can characterize workloads using parameters that are independent of architecture, such as this work, then we can more appropriately compare different architectures in an effort to direct processor/code development.Last Updated: November 24, 1998