BU Open Access Articles

Permanent URI for this collection

Browse

Recent Submissions

Now showing 1 - 20 of 5290
  • Item
    The China Historical Christian Database: A Dataset Quantifying Christianity in China from 1550 to 1950
    (MDPI AG, 2024-06) Mayfield, Alex; Frei, Margaret; Ireland, Daryl; Menegon, Eugenio
    The era of digitization is revolutionizing traditional humanities research, presenting both novel methodologies and challenges. This field harnesses quantitative techniques to yield groundbreaking insights, contingent upon comprehensive datasets on historical subjects. The China Historical Christian Database (CHCD) exemplifies this trend, furnishing researchers with a rich repository of historical, relational, and geographical data about Christianity in China from 1550 to 1950. The study of Christianity in China confronts formidable obstacles, including the mobility of historical agents, fluctuating relational networks, and linguistic disparities among scattered sources. The CHCD addresses these challenges by curating an open-access database built in neo4j that records information about Christian institutions in China and those that worked inside of them. Drawing on historical sources, the CHCD contains temporal, relational, and geographic data. The database currently has over 40,000 nodes and 200,000 relationships, and continues to grow. Beyond its utility for religious studies, the CHCD encompasses broader interdisciplinary inquiries including social network analysis, geospatial visualization, and economic modeling. This article introduces the CHCD’s structure, and explains the data collection and curation process.
  • Item
    Boston real estate residential rental trends
    (Kaggle, 2024-10-23) Stoller, Gregory
    Project for a doctoral business class (DBA 746 Artificial Intelligence Applications with Dr. Hunter Monroe) combining Python coding with industry research
  • Item
    Montevideo convention and CGTN: defining statehood for global outreach
    (USC Annenberg Press, 2023) Chen, Alex Keyu; Massignan, Virginia; Yachin, Mor; Winkler, Carol K.; Lokmanoglu, Ayse
  • Item
    ISIS media and troop withdrawal announcements: visualizing community and resilience
    (USC Annenberg Press, 2022) Lokmanoglu, Ayse; Winkler, Carol; McMinimy, Kayla; Almahmoud, Munira
  • Item
    Changing preferences: an experiment and estimation of market-incentive effects on altruism
    (Elsevier BV, 2023-12) Byambadalai, Undral; Ma, Ching-To Albert; Wiesen, Daniel
    This paper studies how altruistic preferences are changed by markets and incentives. We conduct a laboratory experiment with a within-subject design. Subjects are asked to choose health care qualities for hypothetical patients in monopoly, duopoly, and quadropoly. Prices, costs, and patient benefits are experimental incentive parameters. In monopoly, subjects choose quality by trading off between profits and altruistic patient benefits. In duopoly and quadropoly, subjects play a simultaneous-move game. Uncertain about an opponent's altruism, each subject competes for patients by choosing qualities. Bayes-Nash equilibria describe subjects' quality decisions as functions of altruism. Using a nonparametric method, we estimate the population altruism distributions from Bayes-Nash equilibrium qualities in different markets and incentive configurations. Competition tends to reduce altruism, but duopoly and quadropoly equilibrium qualities are much higher than monopoly. Although markets crowd out altruism, the disciplinary powers of market competition are stronger. Counterfactuals confirm markets change preferences.
  • Item
    Shot noise-mitigated secondary electron imaging with ion count-aided microscopy
    (Proceedings of the National Academy of Sciences, 2024-07-30) Agarwal, Akshay; Kasaei, Leila; He, Xinglin; Kitichotkul, Ruangrawee; Hitit, Oğuz Kağan; Peng, Minxu; Schultz, J. Albert; Feldman, Leonard C.; Goyal, Vivek K.
    Modern science is dependent on imaging on the nanoscale, often achieved through processes that detect secondary electrons created by a highly focused incident charged particle beam. Multiple types of measurement noise limit the ultimate trade-off between the image quality and the incident particle dose, which can preclude useful imaging of dose-sensitive samples. Existing methods to improve image quality do not fundamentally mitigate the noise sources. Furthermore, barriers to assigning a physically meaningful scale make the images qualitative. Here, we introduce ion count-aided microscopy (ICAM), which is a quantitative imaging technique that uses statistically principled estimation of the secondary electron yield. With a readily implemented change in data collection, ICAM substantially reduces source shot noise. In helium ion microscopy, we demonstrate 3[Formula: see text] dose reduction and a good match between these empirical results and theoretical performance predictions. ICAM facilitates imaging of fragile samples and may make imaging with heavier particles more attractive.
  • Item
    The particle of Haag's local quantum physics: a critical assessment
    (MDPI AG, 2024-09-01) Jaeger, Gregg
    Rudolf Haag's Local Quantum Physics (LQP) is an alternative framework to conventional relativistic quantum field theory for combining special relativity and quantum theory based on first principles, making it of great interest for the purposes of conceptual analysis despite currently being relatively limited as a tool for making experimental predictions. In LQP, the elementary particles are defined as species of causal link between interaction events, together with which they comprise its most fundamental entities. This notion of particle has yet to be independently assessed as such. Here, it is captured via a set of propositions specifying particle characteristics and then compared to previous particle notions. Haag's particle differs decisively with respect to mechanical intuitions about particles by lacking, among other things, even an approximate independent space-time location. This notion is thus found to differ greatly even from those of relativistic quantum mechanics and quantum field theory, which have been applied to the known elementary particles.
  • Item
    On the Chinese Character 物 (Object)
    (The Commercial Press, Beijing China, 2024-07-01) Huang, Weijia; Pan, Fengfan
  • Item
    KVBench: a key-value benchmarking suite
    (ACM, 2024-06-09) Zhu, Zichen; Athanassoulis, Manos
    Key-value stores are at the core of several modern NoSQL-based data systems, and thus, a comprehensive benchmark tool is of paramount importance in evaluating their performance under different workloads. Prior research reveals that real-world workloads have a diverse range of characteristics, such as the fraction of point queries that target non-existing keys, point and range deletes, as well as, different distributions for queries and updates, all of which have very different performance implications. State-of-the-art key-value workload generators, such as YCSB and db_bench, fail to generate workloads that emulate these practical workloads, limiting the dimensions on which we can benchmark the systems' performance. In this paper, we present KVBench, a novel synthetic workload generator that fills the gap between classical key-value workload generators and more complex real-life workloads. KVBench supports a wide range of operations, including point queries, range queries, inserts, updates, deletes, range deletes, and among these options, inserts, queries, and updates can be customized by different distributions. Compared to state-of-the-art key-value workload generators, KVBench offers a richer array of knobs, including the proportion of empty point queries, customized distributions for updates and queries, and range deletes with specific selectivity, constituting a significantly flexible framework that can better emulate real-world workloads.
  • Item
    In situ neighborhood sampling for large-scale GNN training
    (ACM, 2024-06-09) Song, Yuhang; Chen, Po Hao; Lu, Yuchen; Abrar, Naima; Kalavri, Vasiliki
    Graph Neural Network (GNN) training algorithms commonly perform neighborhood sampling to construct fixed-size mini-batches for weight aggregation on GPUs. State-of-the-art disk-based GNN frameworks compute sampling on the CPU, transferring edge partitions from disk to memory for every mini-batch. We argue that this design incurs significant waste of PCIe bandwidth, as entire neighborhoods are transferred to main memory only to be discarded after sampling. In this paper, we make the first step towards an inherently different approach that harnesses near-storage compute technology to achieve efficient large-scale GNN training. We target a single machine with one or more SmartSSD devices and develop a high-throughput, epoch-wide sampling FPGA kernel that enables pipelining across epochs. When compared to a baseline random-access sampling kernel, our solution achieves up to 4.26× lower sampling time per epoch.
  • Item
    Benchmarking learned and LSM indexes for data sortedness
    (ACM, 2024-06-09) Raman, Aneesh; Huynh, Andy; Lu, Jinqi; Athanassoulis, Manos
    Database systems use indexes on frequently accessed attributes to accelerate query and transaction processing. This requires paying the cost of maintaining and updating those indexes, which can be thought of as the process of adding structure (e.g., sort) to an otherwise unstructured data collection. The indexing cost is redundant when data arrives pre-sorted, even if only up to some degree. While recent work has studied how classical indexes like B+-trees cannot fully exploit the near-sortedness during ingestion, there is a lack of this exploration on other index designs like read-optimized learned indexes or write-optimized LSM-trees. In this paper, we bridge this gap by conducting the first-ever study on the behavior of learned indexes and LSM-trees when varying the data sortedness in an ingestion workload. Specifically, we build on prior work on benchmarking data sortedness on B+-trees and we expand the scope to benchmark: (i) ALEX and LIPP, which are updatable learned index designs; and (ii) the LSM-tree engine offered by RocksDB. We present our evaluation framework and detail key insights on the performance of the index designs when varying data sortedness. Our observations indicate that learned indexes exhibit unpredictable performance when ingesting differently sorted data, while LSM-trees can benefit from sortedness-aware optimizations. We highlight the potential headroom for improvement and lay the groundwork for further research in this domain.
  • Item
    SmartFuse: reconfigurable smart switches to accelerate fused collectives in HPC applications
    (ACM, 2024-06-03) Guo, Anqi; Herbordt, Martin Christopher
    Communication switches have sometimes been augmented to process collectives, e.g., in the IBM BlueGene and Mellanox SHArP switches. In this work, we find that there is a great acceleration opportunity through the further augmentation of switches to accelerate more complex functions that combine communication with computation. We consider three types of such functions. The first is fully-fused collectives built by fusing multiple existing collectives like Allreduce with Alltoall. The second is semi-fused collectives built by combining a collective with another computation. The third are higher-order collectives built by combining multiple computations and communications, such as to perform matrix-matrix multiply (PGEMM). In this work, we propose a framework called SmartFuse to accelerate fused collective functions. The core of SmartFuse is a reconfigurable smart switch to support these operations. The semi/fully fused collectives are implemented with a CGRA-like architecture, while higher-order collectives are implemented with a more specialized computational unit that can also schedule communication. Supporting our framework is software to evaluate and translate relevant parts of the input program, compile them into a control data flow graph, and then map this graph to the switch hardware. The proposed framework, once deployed, has the strong potential to accelerate existing HPC applications transparently by encapsulation within an MPI implementation. Experimental results show that this approach improves the performance of the PGEMM kernel, miniFE, and AMG by, on average, 94%, 15%, and 13%, respectively.
  • Item
    CAVE: concurrency-aware graph processing on SSDs
    (ACM, 2024-05-30) Papon, Tarikul Islam; Chen, Taishan; Athanassoulis, Manos
    Large-scale graph analytics has become increasingly common in areas like social networks, physical sciences, transportation networks, and recommendation systems. Since many such practical graphs do not fit in main memory, graph analytics performance depends on efficiently utilizing underlying storage devices. These out-of-core graph processing systems employ sharding and sub-graph partitioning to optimize for storage while relying on efficient sequential access of traditional hard disks. However, today's storage is increasingly based on solid-state drives (SSDs) that exhibit high internal parallelism and efficient random accesses. Yet, state-of-the-art graph processing systems do not explicitly exploit those properties, resulting in subpar performance. In this paper, we develop CAVE, the first graph processing engine that optimally exploits underlying SSD-based storage by harnessing the available storage device parallelism via carefully selecting which I/Os to graph data can be issued concurrently. Thus, CAVE traverses multiple paths and processes multiple nodes and edges concurrently, achieving parallelization at a granular level. We identify two key ways to parallelize graph traversal algorithms based on the graph structure and algorithm: intra-subgraph and inter-subgraph parallelization. The first identifies subgraphs that contain vertices that can be accessed in parallel, while the latter identifies subgraphs that can be processed in their entirety in parallel. To showcase the benefit of our approach, we build within CAVE parallelized versions of five popular graph algorithms (Breadth-First Search, Depth-First Search, Weakly Connected Components, PageRank, Random Walk) that exploit the full bandwidth of the underlying device. CAVE uses a blocked file format based on adjacency lists and employs a concurrent cache pool that is essential to the parallelization of graph algorithms. By experimenting with different types of graphs on three SSD devices, we demonstrate that CAVE utilizes the available parallelism, and scales to diverse real-world graph datasets. CAVE achieves up to one order of magnitude speedup compared to the popular out-of-core systems Mosaic and GridGraph, and up to three orders of magnitude speedup in runtime compared to GraphChi.
  • Item
    TikTok and the art of personalization: investigating exploration and exploitation on social media feeds
    (ACM, 2024-05-13) Vombatkere, Karan
    Recommendation algorithms for social media feeds often function as black boxes from the perspective of users. We aim to detect whether social media feed recommendations are personalized to users, and to characterize the factors contributing to personalization in these feeds. We introduce a general framework to examine a set of social media feed recommendations for a user as a timeline. We label items in the timeline as the result of exploration vs. exploitation of the user's interests on the part of the recommendation algorithm and introduce a set of metrics to capture the extent of personalization across user timelines. We apply our framework to a real TikTok dataset and validate our results using a baseline generated from automated TikTok bots, as well as a randomized baseline. We also investigate the extent to which factors such as video viewing duration, liking, and following drive the personalization of content on TikTok. Our results demonstrate that our framework produces intuitive and explainable results, and can be used to audit and understand personalization in social media feeds.
  • Item
    Team formation amidst conflicts
    (ACM, 2024-05-13) Nikolaou, Iasonas; Terzi, Evimaria
    In this work, we formulate the problem of team formation amidst conflicts. The goal is to assign individuals to tasks, with given capacities, taking into account individuals' task preferences and the conflicts between them. Using dependent rounding schemes as our main toolbox, we provide efficient approximation algorithms. Our framework is extremely versatile and can model many different real-world scenarios as they arise in educational settings and human-resource management. We test and deploy our algorithms on real-world datasets and we show that our algorithms find assignments that are better than those found by natural baselines. In the educational setting we also show how our assignments are far better than those done manually by human experts. In the humanresource management application we show how our assignments increase the diversity of teams. Finally, using a synthetic dataset we demonstrate that our algorithms scale very well in practice.
  • Item
    Scalability limitations of processing-in-memory using real system evaluations
    (ACM, 2024-02-21) Joshi, Ajay
    Processing-in-memory (PIM), where the compute is moved closer to the memory or the data, has been widely explored to accelerate emerging workloads. Recently, different PIM-based systems have been announced by memory vendors to minimize data movement and improve performance as well as energy efficiency. One critical component of PIM is the large amount of compute parallelism provided across many PIM "nodes'' or the compute units near the memory. In this work, we provide an extensive evaluation and analysis of real PIM systems based on UPMEM PIM. We show that while there are benefits of PIM, there are also scalability challenges and limitations as the number of PIM nodes increases. In particular, we show how collective communications that are commonly found in many kernels/workloads can be problematic for PIM systems. To evaluate the impact of collective communication in PIM architectures, we provide an in-depth analysis of two workloads on the UPMEM PIM system that utilize representative common collective communication patterns -- AllReduce and All-to-All communication. Specifically, we evaluate 1) embedding tables that are commonly used in recommendation systems that require AllReduce and 2) the Number Theoretic Transform (NTT) kernel which is a critical component of Fully Homomorphic Encryption (FHE) that requires All-to-All communication. We analyze the performance benefits of these workloads and show how they can be efficiently mapped to the PIM architecture through alternative data partitioning. However, since each PIM compute unit can only access its local memory, when communication is necessary between PIM nodes (or remote data is needed), communication between the compute units must be done through the host CPU, thereby severely hampering application performance. To increase the scalability (or applicability) of PIM to future workloads, we make the case for how future PIM architectures need efficient communication or interconnection networks between the PIM nodes that require both hardware and software support.
  • Item
    Limits to asset manager adaptation
    (Cambridge University Press, 2024-08-27) Condon, Madison
    In Our Lives in Their Portfolios, Brett Christophers provides an account of the rise of ‘asset manager society’ – a world in which the infrastructures of public life are converted from public to private ownership. Here I use Christophers’ analysis to comment on growing calls for asset manager investment in climate adaptation. The asset manager business model requires ever-escalating returns, a poor fit with the now unavoidable losses that climate change promises to bring.
  • Item
    Who is a health care provider?: statutory interpretation as a middle-ground approach to medical malpractice damage caps
    (Cambridge University Press, 2024-04-02) Margolis, Isaac
    Debates over the effectiveness, constitutionality, and fairness of medical malpractice damage caps are as old as the laws themselves. Though some courts have struck down damage caps under state constitutional provisions, the vast majority hesitate to invalidate malpractice reform legislation. Instead, statutory interpretation offers a non-constitutional method of challenging the broad scope of damage caps without fully invalidating legislative efforts to curtail “excessive” malpractice liability. This Note examines the term “health care providers” in construing malpractice reform laws and identifies two predominant forms of statutory interpretation that state courts apply. In doing so, this Note offers recommendations for courts and legislatures to best balance the goals of the malpractice reform movement with patients’ interests in recovery for medical injuries.
  • Item
    Abortion access for women in custody in the wake of Dobbs v. Jackson women’s health
    (Cambridge University Press, 2024-04-02) Herr, Allison
    The United States Supreme Court’s decision in Dobbs v. Jackson Women’s Health Organization made it drastically harder for women to access abortions. The Dobbs decision has had a disproportionate impact on women who are incarcerated or on some form of community supervision such as probation or parole. This Note analyzes a potential right to an abortion for women involved in the criminal justice system, even those living in states that have banned or deeply restricted abortion access after the Dobbs decision. In doing so, this Note looks for different constitutional avenues to protect incarcerated women’s right to an abortion, including under the Eighth Amendment to the U.S. Constitution.