pdfgrep snippet
August 4, 2021•433 words
Problem:
Mencari dan pengurutkan file-file PDF yang berisi kata-kata tertentu.
Mencari ulang dan membaca cukup banyak paper dalam format PDF itu seringkali memerlukan waktu banyak dan konsentrasi. Dengan sejumlah hal lain yang harus dikerjakan, sering sulit untuk mengingat kata-kata kunci di dalam setiap makalah. Dengan kata lain, mencari makalah mana saja yang paling banyak mengandung kata kunci yang diperlukan. Salah satu solusinya adalah dengan memeriksa catatan manual (log) dari setiap paper yang dibaca. Tetapi ada cara lain yang relatif lebih praktis untuk cukup banyak keperluan.
Solusi:
- pdfgrep -irl "covid"
- -i Ignore case distinctions in both the PATTERN and the input files.
- -r Recursively search all files (restricted by --include and --exclude) under each directory, following symlinks only if they are on the command line.
- -l Suppress normal output. Instead print the name of each input file that contains a match. This works well with -Z, but many other output options like -n or -c are ignored when -l is specified.
- pdfgrep -irc "covid"
- -c Suppress normal output. Instead print the number of matches for each input file. Note that unlike grep, multiple matches on the same page will be counted individually.
- pdfgrep -Z -ilr "virtual" | xargs -0 pdfgrep -Hc remote
- -Z Output a null byte (called NUL in ASCII and '\0' in C) instead of the colon that usually separates a filename from the rest of the line. This option makes the output unambiguous in the presence of colons, spaces or newlines in the filename. It can be used in conjunction with commands such as xargs -0 or perl -0.
- -H Print the file name for each match. This is the default setting when there is more than one file to search.
- xargs -0 Input items are terminated by a null character instead of by whitespace, and the quotes and backslash are not special (every character is taken literally). Disables the end of file string, which is treated like any other argument. Useful when input items might contain white space, quote marks, or backslashes. The GNU find -print0 option produces input suitable for this mode.
- pdfgrep -rc -i "circuit.*simulation | simulation.*circuit"
- pdfgrep -rl -i "circuit.*simulation | simulation.*circuit"
- pdfgrep -Z -rli "covid" | xargs -0 pdfgrep -Hrci "cdio"
- pdfgrep -Z -rli "power electronics" | xargs -0 pdfgrep -Hrci "circuit"
Perintah:
pdfgrep, grep, awk, sort
Link:
- https://www.mankier.com/1/pdfgrep
- https://pdfgrep.org/doc.html
- https://newbedev.com/how-can-i-use-0-option-to-xargs-when-specifying-the-input-manually