CDC Text Corpora for Learners: HTML Mirrors of MMWR, EID, and PCD
The attached ZIP archives are part of the <a href="https://github.com/cmheilig/harvest-cdc-journals">CDC Text Corpora for Learners</a> program. This version, comprised of 33,567 articles, was constructed on 2024-03-01 using source content retrieved on 2024-01-09.
The attached three ZIP archives contain the 33,567 articles in 33,576 compiled HTML mirrors of the MMWR <a href="https://www.cdc.gov/mmwr/">Morbidity and Mortality Weekly Report</a> including its series: <i>Weekly Reports</i>, <i>Recommendations and Reports</i>, <i>Surveillance Summaries</i>, <i>Supplements</i>, and <i>Notifiable Diseases</i>, a subset of <i>Weekly Reports</i>, constructed ad hoc; EID <a href="https://www.cdc.gov/eid/">Emerging Infectious Diseases</a>; and PCD <a href="https://www.cdc.gov/pcd/">Preventing Chronic Disease</a>.There is one archive per series. The archive attachments are located in the <i>About this Dataset</i> section of this landing page. In that section when you click Show More, the attachments are located in the section <i>Attachments</i>.
The retrieval and organization of the files included making as few changes to raw sources as possible, to support as many downstream uses as possible.
Querying over HTTP
Splitgraph serves as an HTTP API that lets you run SQL queries directly on
this data to power Web applications. For example:
curl https://data.splitgraph.com/sql/query/ddn \
-H "Content-Type: application/json" \
-d@-<<EOF
{"sql": "
SELECT *
FROM \"cdc-gov/cdc-text-corpora-for-learners-html-mirrors-of-mmwr-ut5n-bmc3\".\"cdc_text_corpora_for_learners_html_mirrors_of_mmwr\"
LIMIT 100
"}
EOF
See the Splitgraph documentation
for more information.