seirdy.one/scripts/xhtmlize-single-file.sh

#!/bin/sh
# copies an .html file to an equivalent .xhtml file, but replaces
# the meta charset with an XML declaration for compatibility with some
# XML tooling.
# Expects polygot XHTML(5) markup.
# Formats both the .html and .xhtml file.
# this means that every index.html file has an equivalent index.xhtml file.
# content negotiation allows an agent to pick html or xhtml.

# use xmllint to do the formatting.
# xmllint ruins inline CSS so delete the inline CSS and re-insert it.
# xmllint also adds extra whitespace around <pre><code> which we remove
# with sed.
# It also decreases indents by one level

#shellcheck disable=SC3040  # This only sets pipefail if it's available and otherwise does nothing
set -o pipefail 2>/dev/null || true
set -e -u

html_file="$1"
tmp_file="$(mktemp)"

cleanup() {
	rm -f "$tmp_file"
}
trap cleanup EXIT

# run_tidy() {
# 	tidy -asxhtml -config linter-configs/tidy.conf 2>/dev/null || true
# }

run_xmllint() {
	xmllint --format --encode UTF-8 --noent - || {
		echo "$html_file"
		exit 1
	}
}

# delete the stylesheet from the html file; we'll re-insert it later.
sed 7d "$html_file" | run_xmllint | tail -n +2 >"$tmp_file"
{
	head -n7 "$tmp_file" | sed -e 's/^\t//'
	cat "${OUTPUT_DIR:?}/tmp.xhtml"
	# shellcheck disable=SC2016 # these are regex statements, not shell expressions
	#shellcheck source=/home/rkumar/Executables/ghq/git.sr.ht/~seirdy/seirdy.one/scripts/xhtmlize.sh
	sed \
		-e '1,7d' \
		-e 's/^\t//' \
		-e 's|</span><span itemprop="familyName"|</span>\&#160;<span itemprop="familyName"|' \
		-e 's|class="u-photo photo"[^<]*<|class="u-photo photo"/> <|' \
		-e 's|<pre>|<pre tabindex="0">|' \
		-E \
		-e 's|([a-z])<data|\1 <data|' \
		-e 's#</span>(<a[^>]*rel="(nofollow ugc|ugc nofollow)"([^>]*)?>liked</a>)#</span> \1#' \
		-e 's#^[\t\s]*<(code|/pre)#<\1#' \
		"$tmp_file" \
		| awk '/(^<\/code>|<pre tabindex="0">)/{printf "%s",$0;next}7'
} >"$html_file"
xhtmlize: switch back to using xmllint Switch back to using xmllint, but now extract + replace the CSS to address the escaping. 2022-06-02 05:11:40 +00:00			`#!/bin/sh`
			`# copies an .html file to an equivalent .xhtml file, but replaces`
			`# the meta charset with an XML declaration for compatibility with some`
			`# XML tooling.`
			`# Expects polygot XHTML(5) markup.`
			`# Formats both the .html and .xhtml file.`
			`# this means that every index.html file has an equivalent index.xhtml file.`
			`# content negotiation allows an agent to pick html or xhtml.`

			`# use xmllint to do the formatting.`
			`# xmllint ruins inline CSS so delete the inline CSS and re-insert it.`
			`# xmllint also adds extra whitespace around <pre><code> which we remove`
Refactor post-processing to use sed instead of sd html-tidy takes care of some post-processing, rendering other substitutions obsolete. Remove the obsolete regex substitutions. Now that we did that, the remaining substitutions can be done with vanilla POSIX or POSIX-Extended regular expressions. Replace sd with sed, and group the substitutions together into one invocation instead of multiple invocations piped together. This change speeds up post-processing to be almost as fast as the initial build step. 2023-11-15 05:39:53 +00:00			`# with sed.`
Decrease markup indentation by one level 2022-06-05 05:36:55 +00:00			`# It also decreases indents by one level`
xhtmlize: switch back to using xmllint Switch back to using xmllint, but now extract + replace the CSS to address the escaping. 2022-06-02 05:11:40 +00:00
Opportunistically use pipefail 2023-11-25 21:40:52 +00:00			`#shellcheck disable=SC3040 # This only sets pipefail if it's available and otherwise does nothing`
			`set -o pipefail 2>/dev/null \|\| true`
xhtmlize: switch back to using xmllint Switch back to using xmllint, but now extract + replace the CSS to address the escaping. 2022-06-02 05:11:40 +00:00			`set -e -u`

xhtmlize: better formatting, exit on err - make xhtmlize exit early if any file fails to be xhtml-ized - add whitespace before self-closing slash. 2022-06-13 15:27:09 +00:00			`html_file="$1"`
Refactor post-processing to use sed instead of sd html-tidy takes care of some post-processing, rendering other substitutions obsolete. Remove the obsolete regex substitutions. Now that we did that, the remaining substitutions can be done with vanilla POSIX or POSIX-Extended regular expressions. Replace sd with sed, and group the substitutions together into one invocation instead of multiple invocations piped together. This change speeds up post-processing to be almost as fast as the initial build step. 2023-11-15 05:39:53 +00:00			`tmp_file="$(mktemp)"`
xhtmlize: switch back to using xmllint Switch back to using xmllint, but now extract + replace the CSS to address the escaping. 2022-06-02 05:11:40 +00:00
			`cleanup() {`
			`rm -f "$tmp_file"`
			`}`
			`trap cleanup EXIT`

try to get xmllint to print diagnostics 2023-11-26 22:37:07 +00:00			`# run_tidy() {`
			`# tidy -asxhtml -config linter-configs/tidy.conf 2>/dev/null \|\| true`
			`# }`

			`run_xmllint() {`
			`xmllint --format --encode UTF-8 --noent - \|\| {`
			`echo "$html_file"`
			`exit 1`
			`}`
use html-tidy to format my pages 2023-11-14 01:33:18 +00:00			`}`

xhtmlize: better formatting, exit on err - make xhtmlize exit early if any file fails to be xhtml-ized - add whitespace before self-closing slash. 2022-06-13 15:27:09 +00:00			`# delete the stylesheet from the html file; we'll re-insert it later.`
try to get xmllint to print diagnostics 2023-11-26 22:37:07 +00:00			`sed 7d "$html_file" \| run_xmllint \| tail -n +2 >"$tmp_file"`
Internal: improve scripts - I forgot to compress xhtml files. fix that. - Stylistic change: remove unnecessary brace expansions - Don't repeatedly append to a file; run commands in a different scope and write all at once. Move Nu HTML validator filtering into a shell script: - Return a bad exit code if validation errors are found after filtering - Remove null-ish values from the JSON output; the final output should be an empty string, since nothing should be reported. 2022-06-03 00:18:13 +00:00			`{`
reduce one indent level 2023-12-13 02:24:47 +00:00			`head -n7 "$tmp_file" \| sed -e 's/^\t//'`
Add cdata to style tag 2024-01-30 23:50:37 +00:00			`cat "${OUTPUT_DIR:?}/tmp.xhtml"`
Automatically pull webring links during build 2022-06-07 16:39:32 +00:00			`# shellcheck disable=SC2016 # these are regex statements, not shell expressions`
Add HTML-Tidy's generator tag manually It auto-inserts it to the start of <head> but I want it at the end. 2023-11-15 10:26:52 +00:00			`#shellcheck source=/home/rkumar/Executables/ghq/git.sr.ht/~seirdy/seirdy.one/scripts/xhtmlize.sh`
Refactor post-processing to use sed instead of sd html-tidy takes care of some post-processing, rendering other substitutions obsolete. Remove the obsolete regex substitutions. Now that we did that, the remaining substitutions can be done with vanilla POSIX or POSIX-Extended regular expressions. Replace sd with sed, and group the substitutions together into one invocation instead of multiple invocations piped together. This change speeds up post-processing to be almost as fast as the initial build step. 2023-11-15 05:39:53 +00:00			`sed \`
Format shell scripts 2023-11-24 03:06:46 +00:00			`-e '1,7d' \`
reduce one indent level 2023-12-13 02:24:47 +00:00			`-e 's/^\t//' \`
Fix syntax error 2023-11-27 07:59:31 +00:00			`-e 's\|</span><span itemprop="familyName"\|</span>\ <span itemprop="familyName"\|' \`
Format shell scripts 2023-11-24 03:06:46 +00:00			`-e 's\|class="u-photo photo"[^<]*<\|class="u-photo photo"/> <\|' \`
Fix tab-reachability of preformatted blocks 2023-12-09 00:08:58 +00:00			`-e 's\|<pre>\|<pre tabindex="0">\|' \`
Format shell scripts 2023-11-24 03:06:46 +00:00			`-E \`
			`-e 's\|([a-z])<data\|\1 <data\|' \`
			`-e 's#</span>(<a[^>]rel="(nofollow ugc\|ugc nofollow)"([^>])?>liked</a>)#</span> \1#' \`
			`-e 's#^[\t\s]*<(code\|/pre)#<\1#' \`
re-introduce awk cmd previously thought obsolete 2023-11-27 20:27:14 +00:00			`"$tmp_file" \`
Fix whitespace with XHTML parser Remove leading whitespace in codeblocks that only appears when using browser XHTML parsers for some reason. 2023-12-04 07:47:35 +00:00			`\| awk '/(^<\/code>\|<pre tabindex="0">)/{printf "%s",$0;next}7'`
Speed up internal jobs - Make xhtml and html alternates the same (we're polygot), cutting static-compression time in half - Make axe-ff run on local files, reducing some overhead. 2022-08-11 04:30:22 +00:00			`} >"$html_file"`