Fresh Hacker News | Show HN: Large Scale Article Extract of Newspapers 1730s-1960s

▲Show HN: Large Scale Article Extract of Newspapers 1730s-1960s(snewpapers.com)

14 points by brettnbutter 3 hours ago | 3 comments

A few examples you can click on without having to authenticate or click the free trial (no cc if you do though and I won't bother you or chase you with spam etc...)

https://snewpapers.com/components/b2d40c08-db63-40e8-890f-09...

https://snewpapers.com/components/0fabc8e4-a60b-4f31-9ad1-b0...

https://snewpapers.com/components/cdde790f-4e97-4f2d-a2c2-95...

▲StilesCrisis 25 minutes ago

I see an obvious typo in the first one: "wickked deeds of witchecran" (should be craft)

I can see why the OCR is a challenge here, and spellcheck is a lost cause, but I'm surprised an LLM cleanup pass didn't detect this?

▲brettnbutter 17 minutes ago

In hindsight it was probably a terrible example to you use, because people will think the OCR is off, but if you click on the clipping (or download the PDF from the PDF link at the top) and zoom in, you'll see that it's verbatim quoting some ancient text, which uses a lot of old-timey spelling (wickked e.g. is actually spelled wickked in the article), so I'm pretty happy with the quality it managed to eek out on that!

Check out the other examples for a more representative quality :-)

▲zzleeper 1 hour ago

Looks cool, congrats!

I've also worked with this data, but only for research purposes:

https://www.finhist.com/bank-runs/episodes/13895.html https://www.finhist.com/bank-runs/index.html

Surprisingly, I found out that layout was the trickiest thing, as newspaper articles often had multiple layers of headers, spanned multiple columns, etc.

Do you have a preferred solution on that?

▲brettnbutter 57 minutes ago

Nice collection you have there.

Just asked the Sleuth for some examples of that, and here's one to add to your Unional National one: https://www.finhist.com/bank-runs/episodes/19827.html

https://snewpapers.com/components/0b22f0ca-60d2-4d63-be99-74...

Yes I agree the layouts are the trickiest part. I tried a few and ended up using some of the Paddle Paddle models for document layout analysis and orientation and such, which give bounding boxes and predicted reading order, but the reading orders aren't great even with SOTA most recent models on complex layouts, or even simple layouts when you have mastheads or images or other artifacts to work around. It's still valuable information that can be combined with heuristics though to stitch together a more accurate reading order, as the starting point of a pipeline

▲zzleeper 24 minutes ago

Great! Was thinking about PP but because I only ran an order of magnitude fewer articles (under 1mm pages; by piggybacking on Dell's OCR) I relied on Arcanum ( https://www.arcanum.com/en/newspaper-segmentation/about/ ) which was cheap enough (but I think not cheap enough at your scale).

Cheers!

▲benwills 2 hours ago

As someone who has done a lot of downloading/parsing, this is so awesome and impressive to see.

One thing to think about, which I also struggle with when it comes to large and complicated datasets, is the UI. Even being in the search industry for a long time, it's difficult for me to concretely see how I would use this.

I'd suggest taking a small sample of the dataset that might be reflective of how people would use it, then make that segment public and immediately searchable without registering. eg: One year of articles related to the Olympics.

What I've found is that it's hard for a lot of people to imagine how they would use something without actually using it. So giving people the actual experience of searching the archive and interacting with the results would go a long way.

Again, congrats on the work. This is really impressive work.

▲brettnbutter 1 hour ago

Thank you, I really appreciate it. I will see if I can figure out how to do that, or like "if you're authed, you can try the Sleuth or get x free searches a month"? The balance is trying to do that without (potentially) overwhelming the databases, more so than trying intentionally to gate people out from anything. I'll figure it out!

I don't know if you looked at the "Label Specific" search, but I think I could fairly easily isolate that to a particular label and sub-type for people to search within without much risk to the backend. Any thoughts on a good category?