Astro - Hacker News

36 comments

teddy-smith 15 minutes ago

It's extremely easy to convert HTML/CSS to a PDF with the print to PDF feature of the browser.
All papers should be in HTML/CSS or Tex then just simply converted to PDF.
Why are we even talking about this?
[-]
- tefkah 9 minutes ago
  
  What are you talking about? No one’s writing their paper in HTML.
  The problem is having the submissions be in TeX and converting that to HTML, when the only output has been PDF for so long.
  The problem isn’t converting HTML to PDF, it’s making available a giant portion of TeX/pdf only papers in HTML.
  If you’re arguing that maybe TeX then shouldn’t be the source format for papers then I agree, but other than Typst (which also isn’t perfect about HTML output yet) there aren’t that many widely accepted/used authoring formats for physics/math papers, which is what ArXiV primarily hosts.
ForceBru 2 hours ago

Is this new or somehow updated? HTML versions of papers have been available for several years now.
EDIT: indeed, it was introduced in 2023: https://blog.arxiv.org/2023/12/21/accessibility-update-arxiv...
[-]
- Tagbert 2 hours ago
  
  From the paper...
  Why "experimental" HTML?
  Did you know that 90% of submissions to arXiv are in TeX format, mostly LaTeX? That poses a unique accessibility challenge: to accurately convert from TeX—a very extensible language used in myriad unique ways by authors—to HTML, a language that is much more accessible to screen readers and text-to-speech software, screen magnifiers, and mobile devices. In addition to the technical challenges, the conversion must be both rapid and automated in order to maintain arXiv’s core service of free and fast dissemination.
  [-]
  - ForceBru an hour ago
    
    No I mean _arXiv_ has had experimental support for generating HTML versions of papers for years now. If you visit arXiv, you'll see a lot of papers have generated HTML alongside the usual PDF, so I'm trying to understand whether the article discussed any new developments. It seems like it's not new at all
billconan an hour ago

I don't think HTML is the right approach. HTML is better than PDF, but it is still a format for displaying/rendering.
the actual paper content format should be separated from its rendering.
i.e. it should contain abstract, sections, equations, figures, citations etc. but it shouldn't have font sizes, layout etc.
the viewer platforms then should be able to style the content differently.
[-]
- bob1029 19 minutes ago
  
  > HTML is better than PDF
  I disagree. PDF is the most desirable format for printed media and its analogues. Any time I plan to seriously entertain a paper from Arxiv, I print it out first. I prefer to have the author's original intent in hand. Arbitrary page breaks and layout shifts that are a result of my specific hardware/software configuration are not desirable to me in this context of use.
  [-]
  - ACCount37 9 minutes ago
    
    I agree that PDF is best for things that are meant to be printed, no questions. But I wonder how common actually printing those papers is?
    In research and in embedded hardware both, I've met some people who had entire stacks of papers printed out - research papers or datasheets or application notes - but also people who had 3 monitors and 64GB of RAM and all the papers open as browser tabs.
    I'm far closer to the latter myself. Is this a "generational split" thing?
- dimal 34 minutes ago
  
  Perfect is the enemy of good. HTML is good enough. Let’s get this done.
  And as another commenter has pointed out, HTML does exactly what you ask for. If it’s done correctly, it doesn’t contain font sizes or layout. Users can style HTML differently with custom CSS.
  [-]
  - billconan 27 minutes ago
    
    mixing rendering definitions with content (PDF) is something from the printer era, that is unsuitable for the digital era.
    HTML was a digital format, but it wanted to be a generic format for all document types, not just papers, so it contains a lot of extras that a paper format doesn't need.
    for research papers, since they share the same structure, we can further separate content from rendering.
    for example, if you want to later connect a paper with an AI, do you want to send <div class="abstract"> ... ?
    or do some nasty heuristic to extract the abstract? like document. getElementsByClassName("abstract")[0] ?
    
    [-]
    
    simonw 16 minutes ago
    
    All of the interesting LLMs can handle a full paper these days without any trouble at all. I don't think it's worth spending much time optimizing for that use-case any more - that was much more important two years ago when most models topped out at 4,000 or 8,000 tokens.
- afavour an hour ago
  
  Wouldn’t that be CSS?
  [-]
  - billconan 39 minutes ago
    
    no
    <div class="abstract-container">
    <div class="abstract">
    <pre><code> abstract text ... </code></pre>
    </div>
    <div class="author-list">
    <ol>
    <li>author one</li>
    <li>author two</li>
    <ol>
    </div>
    should be just:
    [abstract]
    abstract text
    [authors]
    author one | email | affiliation
    author two | email | affiliation
    
    [-]
    
    afavour 28 minutes ago
    
    Sounds like XML and XSL would be a great fit here. Shame it’s being deprecated.
    But you could still use HTML. Elements with a dash in are reserved for custom elements (that is, a new standardised element will never take that name) so you could do:
    <paper-author-list> <paper-author /> </paper-author-list>
    And it would be valid HTML. Then you’d style it with CSS, with
    paper-author { display: list-item; }
    And so on.
    
    [-]
    
    bawolff 17 minutes ago
    
    Nothing is stopping you from using server side XSL. I personally dont think its a great fit, but people need to stop acting like xsl has been wiped from the face of the earth.
    
    [-]
    
    afavour 8 minutes ago
    
    Yes but we’re specifically talking about a display format here. Something requiring a server side transform before being viewable by a user is a clear step backwards.
    
    panzi 27 minutes ago
    
    There is <article> <section> <figure> <legend>, but yes, <abstract> and <authors> is missing as such. But there are meta tags for such things. Then there is RDF and Thing. Not quite the same, I know, but it's not completely useless.
    
    [-]
    
    kevindamm 8 minutes ago
    
    and you could shim these gaps with custom components, hypothetically
cubefox 3 minutes ago

This is not new, the title should say (2023). They have shipped the HTML feature with "experimental" flag for two years now, but I don't know whether there is even any plan to move out of the experimental phase.
It's not much of an "experiment" if you don't plan to use some experimental data to improve things somehow.
Barbing an hour ago

>Did you know that 90% of submissions to arXiv are in TeX format, mostly LaTeX? That poses a unique accessibility challenge: to accurately convert from TeX—a very extensible language used in myriad unique ways by authors—to HTML, a language that is much more accessible to screen readers and text-to-speech software, screen magnifiers, and mobile devices.
Challenging. Good work!
el3ctron 2 hours ago

Accessibility barriers in research are not new, but they are urgent. The message we have heard from our community is that arXiv can have the most impact in the shortest time by offering HTML papers alongside the existing PDF.
[-]
- lalithaar 2 hours ago
  
  Hello, I was going through html versions of my preprints on Arxiv, thank you for all that you guys do Please do let me know if the community could contribute through any means for the same
vatsachak 17 minutes ago

Why do we like HTML more than pdfs?
HTML rendering requires you to be connected to the internet, or setting up the images and mathJax locally. A PDF just works.
HTML obviously supports dynamic embedding, such as programs much better, but people just usually post a github.io page with the paper.
[-]
- devnull3 12 minutes ago
  
  > HTML rendering requires you to be connected to the internet
  Not really. One can always generate a self-contained html. Both CSS and JS (if needed) can be inline.
- recursive 11 minutes ago
  
  Why would html rendering require a network connection? It doesn't seem to on my machine.
sega_sai an hour ago

Unfortunately I didn't see the recommendation there on what can be done for old papers. I checked, and only my papers after 2022 have an HTML version. I wish they'd make some kind of 'try html' button for those.
[-]
- sundarurfriend an hour ago
  
  Do the older papers work via [Ar5iv](https://ar5iv.labs.arxiv.org/) ?
  > View any arXiv article URL [in HTML] by changing the X to a 5
  The line
  > Sources upto the end of November 2025.
  sounds to me like this is indeed intended for older articles.
sundarurfriend an hour ago

[Sept 2023] as per the wayback machine.
jas39 an hour ago

Pandoc can convert to svg. It can then be inlined in html. Looks just like latex, though copy/paste isn't very useful
nateroling an hour ago

Seeing the Gemini 3 capabilities, I can imagine a near future where file formats are effectively irrelevant.
[-]
- DANmode an hour ago
  
  Files.
  Truth in general, if we aren't careful.
ashleyn 2 hours ago

Can't help but wonder if this was motivated in part by people feeding papers into LLMs for summary, search, or review. PDF is awful for LLMs. You're effectively pigeonholed into using (PAYING for) Adobe's proprietary app and models which barely hold a candle to Gemini or Claude. There are PDF-to-text converters, but they often munge up the formatting.
[-]
- jrk an hour ago
  
  Not sure when you last tried, but Gemini, Claude, and ChatGPT have all supported pretty effective PDF input for quite a while.
lalithaar 2 hours ago

I was reading through this article too, glad to have found it on here
rootnod3 an hour ago

Maybe unpopular, but papers should be in n markdown flavor to be determined. Just to have them more machine readable.
[-]
- xigoi an hour ago
  
  Compared to HTML, Markdown is very bad at being mahcine-readable.