The Basel BioData community!
2025-11-13
Welcome to the Basel BioData Community meetup on LLMs in Research!
My name is Flavio Lombardo, I am a computational biologist at the University Hospital in Basel. I am happy to share some ideas on how LLMs are changing research (and other fields) and how they could help you.
Link to the meeting Click here for the Google Meet
Install these R packages to follow along:
Step 1: Visit platform.deepseek.com
Step 2: Sign up and get your API key (free $5 credit = ~35M tokens!)
Step 3: Add to your .Renviron file:
Don’t have an API key yet? No problem! Follow along and try later.
For privacy-sensitive work or offline usage:
Then install Ollama app: - Visit: ollama.com - Download for your operating system - Pull a model (we’ll use qwen3:1.7b)
Large Language Models are AI systems trained on vast amounts of text data.
Why should researchers (or you) care?
They can accelerate literature review, data analysis, code development, and scientific writing.
Note: For sensitive data, consider local models
ellmer is Posit’s official R package for working with LLMs (launched Feb 2025!)
Step 1: Get a DeepSeek API key from platform.deepseek.com
Step 2: Set up your API key
Step 3: Create a chat session with DeepSeek
Why DeepSeek? - Much cheaper: $0.14 per 1M tokens vs OpenAI’s $2.50 - Strong coding & research capabilities - Your $5 credit = ~35M tokens!
Note: DeepSeek requires a workaround for structured data (see next slide)
DeepSeek requires forcing JSON output via api_args:
library(ellmer)
library(jsonlite)
# Create chat with JSON output forced
chat <- chat_deepseek(
system_prompt = "You are a research assistant.
Always respond with valid JSON only.",
api_args = list(response_format = list(type = "json_object"))
)
# Ask for structured data
abstract <- "In our 2023 study, we analyzed 150 patients.
The treatment showed significant improvement (p = 0.001)..."
prompt <- 'Extract this data as JSON:
{
"year": <integer>,
"sample_size": <integer>,
"p_value": <number>,
"conclusion": "<string>"
}'
response <- chat$chat(paste(abstract, prompt))
# Parse JSON response
data <- fromJSON(response)
data$year # 2023
data$sample_size # 150
data$p_value # 0.001Go beyond structured extraction and ask direct questions.
library(pdftools)
library(ellmer)
# Extract text from a specific paper
pdf_path <- "Multimodal learning enables chat-based exploration of single-cell data.pdf"
text <- pdf_text(pdf_path) |> paste(collapse = "\n")
# Create a chat session
chat <- chat_deepseek(
system_prompt = "You are a helpful research assistant.
Answer questions based *only* on the provided text.",
api_args = list(),
seed = 1234
)
# Ask a specific question about the paper's content
question <- "What was the primary conclusion regarding this paper?"
# Combine context and question
prompt <- paste(
"Here is the text from a research paper:",
text,
"---",
"Please answer the following question based on the text:",
question,
sep = "\n"
)
# Get the answer
response <- chat$chat(prompt)Extract structured data from multiple papers efficiently:
library(ellmer)
library(jsonlite)
library(pdftools)
library(purrr)
# Create chat with JSON mode
chat <- chat_deepseek(
system_prompt = "Extract paper metadata as valid JSON.
Use null for missing values.",
api_args = list(response_format = list(type = "json_object"))
)
# Define extraction template
extraction_prompt <- 'Extract these fields as JSON:
{
"title": "<string>",
"year": <integer>,
"authors": ["<string>"],
"sample_size": <integer or null>,
"methods": ["<string>"],
"key_finding": "<string>",
"p_values": [<number>]
}'
# Function to extract from one PDF
extract_paper_data <- function(pdf_path) {
text <- pdf_text(pdf_path) |> paste(collapse = "\n")
prompt <- paste(text, extraction_prompt, sep = "\n\n")
response <- chat$chat(prompt)
fromJSON(response)
}
# Process multiple PDFs
pdf_files <- list.files(pattern = "*.pdf", full.names = TRUE)
papers_data <- map(pdf_files, extract_paper_data)
# Convert to data frame
library(dplyr)
papers_df <- papers_data |>
map_dfr(as_tibble) |>
unnest_wider(everything())
# Analyze results
papers_df |>
filter(year >= 2020) |>
arrange(desc(sample_size))For sensitive data or offline work, run models locally with Ollama:
library(ellmer)
library(jsonlite)
# Define JSON schema for extraction
schema <- list(
type = "object",
properties = list(
year = list(type = "integer"),
sample_size = list(type = "integer")
),
required = c("year", "sample_size")
)
# Create chat with JSON mode and clear instructions
chat <- chat_ollama(
model = "qwen3:1.7b",
system_prompt = "You must respond with valid JSON only.
No explanations, no markdown, just JSON.",
api_args = list(format = schema)
)
# Extract data with explicit JSON request
text <- "In 2023, we studied 150 patients..."
prompt <- paste0(
"Extract as JSON: ", text,
"\nReturn format: {\"year\": <int>, \"sample_size\": <int>}"
)
response <- chat$chat(prompt)
data <- fromJSON(response)
# Access results
data$year # 2023
data$sample_size # 150qwen3:1.7b - Fast, good quality (recommended)llama3.2:3b - Balanced performancephi4:latest - Microsoft’s efficient modelapi_args = list(temperature = 0.7)DeepSeek (Recommended - Cheapest): - Visit: platform.deepseek.com - Free $5 credit (~35M tokens!) - $0.14 per 1M tokens
Alternative Providers: - OpenAI: platform.openai.com - GPT-5, GPT-4o
Documentation: - ellmer Documentation - Main package docs - Structured Data Guide - Type-safe extraction - DeepSeek API Docs - API reference - Prompt Engineering Guide - Writing better prompts
Get Help: - Posit Community - Ask questions - R for Data Science - Learn R fundamentals - ellmer GitHub Issues - Report bugs
Stay Updated: - Follow Posit blog for ellmer updates - Join R communities on social media
Let’s discuss:
See you at the next meetup!
Questions? Let’s chat!
Note: This approach uses ellmer’s native chat_structured() which currently works with OpenAI and Claude, but NOT with DeepSeek.
Note: In the API args one could extract the JSON and parse the results with api_args = list(response_format = list(type = "json_object")) forcing the model to output structured data.
Extract typed data (dates, numbers, strings) from text:
library(ellmer)
# Define extraction schema
paper_data <- type_object(
title = type_string(),
publication_date = type_string(),
sample_size = type_integer(),
p_value = type_number(),
conclusion = type_string()
)
# Extract from paper abstract
abstract <- "In our study (2023), we analyzed 150 patients.
The treatment showed significant improvement (p = 0.001)..."
# Use OpenAI or Claude (NOT DeepSeek)
chat <- chat_openai() # or chat_anthropic()
result <- chat$chat_structured(
abstract,
type = paper_data
)
# Access structured data
result$publication_date # "2023"
result$sample_size # 150
result$p_value # 0.001Note: This shows ellmer’s chat_structured() which works with OpenAI/Claude but NOT DeepSeek.
library(pdftools)
library(ellmer)
# Extract PDF text
text <- pdf_text("Multimodal learning enables chat-based exploration of single-cell data.pdf") |> paste(collapse = "\n")
# Define what to extract
schema <- type_object(
title = type_string(),
year = type_integer(),
sample_size = type_integer(),
main_findings = type_string()
)
# Extract structured data with OpenAI or Claude
chat <- chat_openai() # or chat_anthropic()
data <- chat$chat_structured(
text,
type = schema
)
# Access results
data$year # 2023
data$sample_size # 150Tip: Use parallel_chat_structured() for batch processing multiple PDFs!