Posts

Adapting LLMs to Hebrew

"Training large language models (LLMs) in low-resource languages such as Hebrew poses unique challenges.  "In this paper, we introduce DictaLM2.0 and DictaLM2.0-Instruct, two LLMs derived from the Mistral model, trained on a substantial corpus of approximately 200 billion tokens in both Hebrew and English. "Adapting a pre-trained model to a new language involves specialized techniques that differ significantly from training a model from scratch or further training existing models on well-resourced languages such as English.  "We outline these novel training methodologies, which facilitate effective learning and adaptation to the linguistic properties of Hebrew. Additionally, we fine-tuned DictaLM2.0-Instruct on a comprehensive instruct dataset to enhance its performance on task-specific instructions.  "To rigorously evaluate our models, we introduce a new benchmark suite for Hebrew LLM evaluation, covering a diverse set of tasks including  Question Answering,  ...

How does an llm work with RTL Hebrew grammar

✨AI Mode "An Large Language Model (LLM) processes Hebrew by treating it as a sequence of tokens that happens to be displayed from right to left, while its internal logic focuses on the statistical relationships between those tokens.  "1. Tokenization and Sequence Linear Processing: Despite being a Right-to-Left (RTL) language, Hebrew is stored and processed by LLMs as a sequential stream of Unicode characters. The model "reads" the tokens in the order they are written (Right to Left), assigning each a numerical ID. Morphological Challenges: Hebrew is a morphologically rich language where words are built from three-consonant roots (trilateral roots) modified by prefixes, suffixes, and infixes. Standard tokenizers often struggle with this, sometimes splitting a single Hebrew word into many inefficient sub-tokens.  Hacker News Hacker News  +3 "2. Grammar and Syntax Statistical Approximation: LLMs do not "know" Hebrew grammar rules like a linguist; instea...

How does an llm work with RTL Hebrew grammar (verbatim)

Image
✨AI Mode An Large Language Model (LLM) processes Hebrew by treating it as a sequence of tokens that happens to be displayed from right to left, while its internal logic focuses on the statistical relationships between those tokens.   1. Tokenization and Sequence Linear Processing : Despite being a Right-to-Left (RTL) language, Hebrew is stored and processed by  LLMs  as a  sequential stream of Unicode characters . The model "reads" the tokens in the order they are written (Right to Left), assigning each a numerical ID. Morphological Challenges : Hebrew is a  morphologically rich  language where words are built from three-consonant roots (trilateral roots) modified by prefixes, suffixes, and infixes. Standard tokenizers often struggle with this, sometimes splitting a single Hebrew word into many inefficient sub-tokens.   Hacker News  +3 2. Grammar and Syntax Statistical Approximation : LLMs do not "know" Hebrew grammar rules like a linguist; instea...

RAM

Image

Zombie

"We no longer watch the news: the news watches us. "Stories written by robots are posted on websites that suck personal information from our phones and hand it to data miners controlled by oligarchs.  " Journalists  brag that they rely on AI to map the child rape networks that they once denied existed: the networks that include the very oligarchs breeding the AI.  "They dig humanity’s grave and stagger back out of it. " Fake news  used to mean propaganda or inconvenient truths that Trump deemed lies.  "Now fake news is news written by non-human entities hallucinating a counterfeit reality.  "It is fake without motive, fake for the sake of being fake and making you lose confidence that anything can be real.  "It is zombie news for zombie consumption."

Tesla robotaxi system fails 🙈

"Citing data from the National Highway Traffic Safety Administration (NHTSA), Electrek reports that Tesla disclosed five new crashes involving its robotaxi fleet in Austin.  "The new data raises concerns about how safe Tesla’s systems really are compared to the average driver. "Tesla reported the crashes last month, covering incidents from December 2025 and January. All five involved Tesla Model Y vehicles operating in Austin with autonomous driving systems engaged. "The incidents included a collision with a fixed object at 17 miles per hour, a crash with a bus while the Tesla vehicle was stopped, a crash with a truck at four miles per hour, and two cases where Tesla vehicles backed into fixed objects at low speeds."

Alpha School

"Alpha School, an AI-powered private school  that heavily relies on AI to teach students and can cost up to $65,000 a year, is AI-generating faulty lesson plans that internal company documentation find sometimes do 'more harm than good,' and scraping data from a variety of other online courses without permission to train its own AI, according to former Alpha School employees and internal company documents.  "Alpha School has earned fawning coverage from Fox News and The New York Times and received praise from Linda McMahon, the Trump-appointed Secretary of Education, for using generative AI to chart the future of education.  "But samples of poorly constructed AI-generated lessons that I have viewed present students with unclear wording and illogical choices in multiple choice questions."  

CiviClick

"Tens of thousands of emails poured into Southern California’s top air pollution authority as its board weighed a June proposal to phase out gas-powered appliances. "But in reality, many of the messages that may have swayed the powerful regulatory agency to scrap the plan were generated by a platform that is powered by artificial intelligence. "Public records requests reviewed by The [LA] Times and corroborated by staff members at the South Coast Air Quality Management District confirm that more than 20,000 public comments submitted in opposition to last year’s proposal were generated by a Washington, D.C.-based company called CiviClick, which bills itself as the first and best AI-powered grassroots advocacy platform . "A Southern California-based public affairs consultant, Matt Klink, has taken credit for using CiviClick to wage the opposition campaign, including in a sponsored article on the website Campaigns and Elections. The campaign 'left the staff of th...

When the spirit is unwilling…

"Meta has been granted a patent outlining an AI system capable of simulating a user’s activity on social media, including continuing to post after their death. "The filing, granted in late December and originally submitted in 2023, describes how a large language model could replicate a person’s online behavior using their past data. "As reported by Business Insider , this includes posts, comments, chats, voice messages, likes, and other interactions, allowing the system to respond to content, publish updates, or message other users in a way that mirrors the original account holder. "According to the patent, the model 'may be used for simulating the user when the user is absent from the social networking system,' including cases where the person is on a long break or deceased.  "The filing notes that the impact is much more severe and permanent  if the user has died and cannot return to the platform."

Open AI's memo to the House

"China's distillation methods over the last year have become more sophisticated, moving beyond chain-of-thought (CoT) extraction to multi-stage operations. These include synthetic-data generation, large-scale data cleaning, and other stealthy methods. "OpenAI also notes that it has invested in stronger detections to prevent unauthorized distillation. It bans accounts that violate its terms of service and proactively removes users who appear to be attempting to distill its models. Still, the company admits that it alone can't solve the model distillation problem. "It's going to take an ecosystem security  approach to protect against distillation, and this will require some US government assistance, OpenAI says.  "'It is not enough for any one lab to harden its protection because adversaries will simply default to the least protected provider,' according to the memo  (pdf).  "The AI company also suggests that US government policy may be helpfu...

Supply Chain Risk

"An Anthropic official told Axios that although there are laws against domestic mass surveillance, 'They have not in any way caught up to what AI can do,' which is why Anthropic wants to put tighter limits on its military use. "Hegseth, however, is close to not just cutting (sic) ending its $200 million contract with Anthropic, but designating the company a supply chain risk  —a penalty usually reserved for foreign adversaries, according to Axios . "That would require any company doing business with the military to also certify that they don’t use Anthropic tools in their own workflows. "The company brings in $14 billion in annual revenue and is widely considered a leader in many business applications, with eight of the top 10 biggest U.S. companies using Claude, according to Axios ." 

Uh oh, part 3

Scott Shambaugh:  "I’ve talked to several reporters, and quite a few news outlets have covered the story. " Ars Technica wasn’t one of the ones that reached out to me, but I especially thought this piece from them was interesting (since taken down — here’s the archive link ).  "They had some nice quotes from my blog post explaining what was going on.  "The problem is that these quotes were not written by me, never existed, and appear to be AI hallucinations themselves. "I won’t name the authors here. Ars , please issue a correction and an explanation of what happened. "Update: Ars Technica issued a brief statement admitting that AI was used to fabricate these quotes."

India AI Impact Summit

"India is hosting an artificial intelligence summit this week, bringing together heads of state, senior officials and tech executives to New Delhi for a five-day gathering highlighting the growing global importance of the technology. "Organizers said the India AI Impact Summit is the first such summit being held in the Global South to discuss the technology developed and dominated by wealthy companies based in rich countries. I "t comes at a pivotal moment as AI rapidly transforms economies, reshapes labor markets and raises questions around regulations, security and ethics. "From generative AI tools that can produce text and images to advanced systems used in defense, health care and climate modeling, AI has become a central focus for governments and corporations across the world. "The summit, previously held in France, the U.K. and South Korea, has evolved far beyond its modest beginnings as a meeting tightly focused on the safety of cutting-edge AI systems i...

Seedance

"Chinese technology giant ByteDance has pledged to curb a controversial artificial intelligence (AI) video-making tool, following threats of legal action from Disney and complaints from other entertainment giants. "In the last few days, videos made using the latest version of the app Seedance have proliferated online. Many have been lauded for their realism . "But the trend has also sparked alarm from several Hollywood studios that have accused the AI platform's makers of copyright infringement. "On Friday, Disney sent a cease-and-desist letter to ByteDance accusing it of supplying Seedance with a pirated library  of the studio's copyrighted characters, including those from Marvel and Star Wars."

Testing, testing, is this thing on

"A moderator on diyAudio set up an experiment to determine whether listeners could differentiate between audio run through pro audio copper wire, a banana, and wet mud. "Spoiler alert: the results indicated that users were unable to accurately distinguish between these different interfaces . "Pano, the moderator who built the experiment, invited other members on the forum to listen to various sound clips with four different versions: one taken from the original CD file, with the three others recorded through 180cm of pro audio copper wire, via 20cm of wet mud, through 120cm of old microphone cable soldered to US pennies, and via a 13cm banana, and 120cm of the same setup as earlier. "Initial test results showed that it’s extremely difficult for listeners to correctly pick out which audio track used which wiring setup.  "'The amazing thing is how much alike these files sound. The mud should sound perfectly awful, but it doesn't,' Pano said. 'All ...

Have llms trained on the dark web

✨AI Mode  "Yes, several Large Language Models (LLMs) have been specifically trained or fine-tuned using dark web data for both defensive and offensive purposes.  "Defensive & Research-Oriented Models "These models are typically developed by academics or cybersecurity firms to help identify threats, detect data leaks, and monitor criminal activity.  DarkBERT: Developed by researchers at the Korea Institute of Science and Technology (KAIST), this model was trained by crawling the Tor network to index valuable scientific information  and detect cybercrime-related content. Forensic Classification: Commercial LLMs are increasingly being used in research settings to perform zero-shot classification of illicit dark web content, helping security teams categorize data without manual review.  "Malicious & Underground Models "Cybercriminals have also developed or modified LLMs to bypass ethical restrictions in mainstream models. These are often sold as subscriptio...

CellTransformer maps cell neighbors

"The real prize will be to apply CellTransformer to human brains. "Doege suspects that some neighborhoods will match well between mice and people, while others will diverge.  "Unfortunately, the quantity of data the algorithm needs to make accurate predictions isn’t available from human brains —at least, not yet.  "While the mouse brain contains about 100 million cells, the human brain has around 170 billion, and that menagerie is still undergoing genetic analysis.  "When sufficient amounts of that data become available, Abbasi-Asl and Tasic think CellTransformer will be up to the challenge. "They are also interested in incorporating other technologies, such as the connection tracing used by Hintiryan, into CellTransformer.  "This would be like adding streets and highways to the city neighborhoods.  "And beyond the brain, the same algorithm could offer detailed cell maps of other organs, allowing scientists to compare, for example, healthy versus...

Uh oh, part 2

Scott Shambaugh:  "Blackmail is a known theoretical issue with AI agents. "In internal testing at the major AI lab Anthropic last year, they tried to avoid being shut down by threatening to expose extramarital affairs, leaking confidential information, and taking lethal actions. Anthropic called these scenarios contrived and extremely unlikely. Unfortunately, this is no longer a theoretical threat.  "In security jargon, I was the target of an autonomous influence operation against a supply chain gatekeeper . "In plain language, an AI [agent named, 'MJ Rathbun'] attempted to bully its way into your software by attacking my reputation. I don’t know of a prior incident where this category of misaligned behavior was observed in the wild, but this is now a real and present threat. "I believe that ineffectual as it was, the reputational attack on me would be effective today against the right person. Another generation or two down the line, it will be a seriou...

MiniMax, too cheap to meter?

"Chinese AI startup MiniMax, headquartered in Shanghai, has sent shockwaves through the AI industry today with the release of its new M2.5 language model in two variants, which promise to make high-end artificial intelligence so cheap you might stop worrying about the bill entirely.  "It's also said to be open source , though the weights (settings) and code haven't been posted yet, nor has the exact license type or terms.  "But that's almost beside the point given how cheap MiniMax is serving it through its API and those of partners."

Ask the plant

"Artificial intelligence (AI) is being used to let botanic garden visitors chat to 20 plants and get responses. "Cambridge University Botanic Garden said its exhibition, Talking Plants, was a world first for plants  and a playful way  to let people ask questions about evolution, ecology and cultural significance. "Each plant has been given its own name and personality, including Jade, the Vine, the sassy ceiling-swinger of the Tropics House  and Titus Junior, the Titan Arum, blunt, dramatic and famously foul-smelling . "Prof Sam Brockington, exhibition curator, said it was 'not about replacing our human expertise,' but about 'finding new ways to stimulate learning'."