publications
publications by categories in reversed chronological order. generated by jekyll-scholar.
2024
- Starcoder 2 and the stack v2: The next generationarXiv preprint arXiv:2402.19173, 2024
- Resources for Combining Teaching and Research in Information Retrieval CourseworkIn Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2024
- Teaching Information Retrieval with a Shared Task Across Universities: First Steps and Findings2024
2023
- Towards openness beyond open access: User journeys through 3 Open AI CollaborativesarXiv preprint arXiv:2301.08488, 2023
- Santacoder: donât reach for the stars!arXiv preprint arXiv:2301.03988, 2023
-
- The ROOTS search tool: Data transparency for LLMsarXiv preprint arXiv:2302.14035, 2023
- Spacerini: Plug-and-play search engines with Pyserini and Hugging FacearXiv preprint arXiv:2302.14534, 2023
- Exploring hyperparameter usage and tuning in machine learning researchPapers on tuning, 2023
- Starcoder: may the source be with you!arXiv preprint arXiv:2305.06161, 2023
- GAIA search: Hugging face and pyserini interoperability for nlp training data explorationarXiv preprint arXiv:2306.01481, 2023
- Stable bias: Evaluating societal representations in diffusion modelsAdvances in Neural Information Processing Systems, 2023
2022
- Tracking discourse influence in darknet forumsarXiv preprint arXiv:2202.02081, 2022
- Entities, dates, and languages: Zero-shot on historical texts with t0arXiv preprint arXiv:2204.05211, 2022
- How TrainâTest Leakage Affects Zero-Shot RetrievalIn International Symposium on String Processing and Information Retrieval, 2022
- Noise-reduction for automatically transferred relevance judgmentsIn International Conference of the Cross-Language Evaluation Forum for European Languages, 2022
- The bigscience roots corpus: A 1.6 tb composite multilingual datasetAdvances in Neural Information Processing Systems, 2022
- Bigscience: A case study in the social construction of a multilingual large language modelarXiv preprint arXiv:2212.04960, 2022
- Bloom: A 176b-parameter open-access multilingual language modelarXiv preprint arXiv:2211.05100, 2022
- How Train-Test Leakage AffectsIn String Processing and Information Retrieval: 29th International Symposium, SPIRE 2022, ConcepciĂłn, Chile, November 8â10, 2022, Proceedings, 2022
2021
- Learning to Rank Arguments with Feature Selection.In CLEF (Working Notes), 2021
- Muse: The musical sentiment datasetJournal of Open Humanities Data, 2021
- BERTian Poetics: Constrained Composition with Masked LMsarXiv preprint arXiv:2110.15181, 2021
2020
- Exploring Argument Retrieval with Transformers.In CLEF (Working Notes), 2020
- Toward a Musical Sentiment (MuSe) Dataset for Affective Distant HearingIn Workshop on Computational Humanities Research (CHR 2020), 2020