I Automated My Entire Research Workflow With 10 Free APIs

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • MyrinNew
    Senior Member
    • Feb 2024
    • 5175

    #1

    I Automated My Entire Research Workflow With 10 Free APIs

    Two weeks ago, I started a research project that required:
    • Academic papers from multiple databases
    • Patent data
    • Clinical trial information
    • Security checks on all downloaded files


    Manually, this would take days. With 10 free APIs, I automated it in an afternoon.


    Here's the stack I built.


    The Research Pipeline





    Query → OpenAlex (papers) → Crossref (metadata) → Unpaywall (free PDFs)
    → PubMed (medical) → ClinicalTrials.gov (trials) → Patents (USPTO)
    → Semantic Scholar (AI summaries) → Export → Analyze







    Each step is one Python function. Total code: ~200 lines.


    Step 1: Find Papers (OpenAlex)





    import requests

    def find_papers(topic, limit=20):
    resp = requests.get('https://api.openalex.org/works', params={
    'search': topic, 'per_page': limit,
    'sort': 'cited_by_count:desc'
    })
    return [{
    'title': w['title'],
    'doi': w.get('doi'),
    'citations': w['cited_by_count'],
    'year': w.get('publication_year')
    } for w in resp.json()['results']]

    papers = find_papers('CRISPR gene editing therapy')
    print(f"Found {len(papers)} papers, top cited: {papers[0]['citations']}")







    Step 2: Enrich Metadata (Crossref)





    def get_metadata(doi):
    if not doi: return {}
    doi_id = doi.replace('https://doi.org/', '')
    resp = requests.get(f'https://api.crossref.org/works/{doi_id}')
    if resp.status_code != 200: return {}
    item = resp.json()['message']
    return {
    'publisher': item.get('publisher'),
    'journal': item.get('container-title', [''])[0],
    'references': item.get('references-count', 0)
    }







    Step 3: Find Free PDFs (Unpaywall)





    def find_pdf(doi):
    if not doi: return None
    doi_id = doi.replace('https://doi.org/', '')
    resp = requests.get(f'https://api.unpaywall.org/v2/{doi_id}',
    params={'email': 'research@example.com'})
    data = resp.json()
    if data.get('is_oa'):
    return data['best_oa_location'].get('url_for_pdf')
    return None







    Step 4: Get AI Summaries (Semantic Scholar)





    def get_tldr(title):
    resp = requests.get('https://api.semanticscholar.org/graph/v1/paper/search',
    params={'query': title, 'limit': 1, 'fields': 'tldr'})
    papers = resp.json().get('data', [])
    if papers and papers[0].get('tldr'):
    return papers[0]['tldr']['text']
    return 'No summary available'







    Step 5: Check Related Trials (ClinicalTrials.gov)





    def find_trials(topic, limit=5):
    resp = requests.get('https://clinicaltrials.gov/api/v2/studies', params={
    'query.term': topic, 'pageSize': limit, 'format': 'json'
    })
    return [{
    'nct_id': s['protocolSection']['identificationModule']['nctId'],
    'title': s['protocolSection']['identificationModule']['briefTitle'],
    'status': s['protocolSection']['statusModule']['overallStatus']
    } for s in resp.json().get('studies', [])]







    Step 6: Check Patents (USPTO)





    def find_patents(topic, limit=5):
    resp = requests.post('https://api.patentsview.org/patents/query', json={
    'q': {'_text_any': {'patent_abstract': topic}},
    'f': ['patent_number', 'patent_title', 'patent_date'],
    'o': {'per_page': limit},
    's': [{'patent_date': 'desc'}]
    })
    return resp.json().get('patents', [])







    The Full Pipeline





    def research(topic):
    print(f"Researching: {topic}\n")

    # Papers
    papers = find_papers(topic, limit=10)
    print(f"📚 {len(papers)} papers found")

    # Enrich top 5 with metadata + PDFs
    for p in papers[:5]:
    meta = get_metadata(p['doi'])
    pdf = find_pdf(p['doi'])
    tldr = get_tldr(p['title'])
    print(f" • {p['title'][:60]}")
    print(f" Citations: {p['citations']} | Journal: {meta.get('journal', 'N/A')}")
    print(f" PDF: {'✅' if pdf else '❌'} | TLDR: {tldr[:80]}...")

    # Clinical trials
    trials = find_trials(topic)
    print(f"\n🏥 {len(trials)} clinical trials")
    for t in trials:
    print(f" [{t['status']}] {t['title'][:60]}")

    # Patents
    patents = find_patents(topic)
    print(f"\n📜 {len(patents)} patents")
    for p in patents:
    print(f" [{p['patent_date']}] {p['patent_title'][:60]}")

    research('CRISPR gene editing therapy')







    Results

    For one query, I got:
    • 10 highly-cited papers with metadata
    • 4 free PDFs (via Unpaywall)
    • AI summaries for all papers
    • 5 active clinical trials
    • 5 related patents


    All in under 30 seconds.


    All Toolkits (Open Source)

    I packaged each step into its own toolkit:


    1 OpenAlex 250M+ academic works
    2 Crossref 150M+ article metadata
    3 PubMed 36M+ medical papers
    4 Semantic Scholar AI summaries
    5 arXiv 2.4M+ preprints
    6 CORE 300M+ open access
    7 Unpaywall Find free PDFs
    8 ClinicalTrials.gov 500K+ trials
    9 USPTO Patents 8M+ patents
    10 Security Scanner 5 security APIs


    Full collection: awesome-free-research-apis





    What would you automate if you had all these APIs in one pipeline? I'm curious about creative use cases.





    Need custom data pipelines? My tools | GitHub




    More...
Working...